diff options
Diffstat (limited to 'doc/rfc/rfc5893.txt')
| -rw-r--r-- | doc/rfc/rfc5893.txt | 955 | 
1 files changed, 955 insertions, 0 deletions
| diff --git a/doc/rfc/rfc5893.txt b/doc/rfc/rfc5893.txt new file mode 100644 index 0000000..c76dfba --- /dev/null +++ b/doc/rfc/rfc5893.txt @@ -0,0 +1,955 @@ + + + + + + +Internet Engineering Task Force (IETF)                H. Alvestrand, Ed. +Request for Comments: 5893                                        Google +Category: Standards Track                                        C. Karp +ISSN: 2070-1721                        Swedish Museum of Natural History +                                                             August 2010 + + +                       Right-to-Left Scripts for +         Internationalized Domain Names for Applications (IDNA) + +Abstract + +   The use of right-to-left scripts in Internationalized Domain Names +   (IDNs) has presented several challenges.  This memo provides a new +   Bidi rule for Internationalized Domain Names for Applications (IDNA) +   labels, based on the encountered problems with some scripts and some +   shortcomings in the 2003 IDNA Bidi criterion. + +Status of This Memo + +   This is an Internet Standards Track document. + +   This document is a product of the Internet Engineering Task Force +   (IETF).  It represents the consensus of the IETF community.  It has +   received public review and has been approved for publication by the +   Internet Engineering Steering Group (IESG).  Further information on +   Internet Standards is available in Section 2 of RFC 5741. + +   Information about the current status of this document, any errata, +   and how to provide feedback on it may be obtained at +   http://www.rfc-editor.org/info/rfc5893. + +Copyright Notice + +   Copyright (c) 2010 IETF Trust and the persons identified as the +   document authors.  All rights reserved. + +   This document is subject to BCP 78 and the IETF Trust's Legal +   Provisions Relating to IETF Documents +   (http://trustee.ietf.org/license-info) in effect on the date of +   publication of this document.  Please review these documents +   carefully, as they describe your rights and restrictions with respect +   to this document.  Code Components extracted from this document must +   include Simplified BSD License text as described in Section 4.e of +   the Trust Legal Provisions and are provided without warranty as +   described in the Simplified BSD License. + + + + + +Alvestrand & Karp            Standards Track                    [Page 1] + +RFC 5893                   IDNA Right to Left                August 2010 + + +Table of Contents + +   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  2 +     1.1.  Purpose and Applicability  . . . . . . . . . . . . . . . .  2 +     1.2.  Background and History . . . . . . . . . . . . . . . . . .  3 +     1.3.  Structure of the Rest of This Document . . . . . . . . . .  3 +     1.4.  Terminology  . . . . . . . . . . . . . . . . . . . . . . .  4 +   2.  The Bidi Rule  . . . . . . . . . . . . . . . . . . . . . . . .  6 +   3.  The Requirement Set for the Bidi Rule  . . . . . . . . . . . .  6 +   4.  Examples of Issues Found with RFC 3454 . . . . . . . . . . . .  9 +     4.1.  Dhivehi  . . . . . . . . . . . . . . . . . . . . . . . . .  9 +     4.2.  Yiddish  . . . . . . . . . . . . . . . . . . . . . . . . . 10 +     4.3.  Strings with Numbers . . . . . . . . . . . . . . . . . . . 12 +   5.  Troublesome Situations and Guidelines  . . . . . . . . . . . . 12 +   6.  Other Issues in Need of Resolution . . . . . . . . . . . . . . 13 +   7.  Compatibility Considerations . . . . . . . . . . . . . . . . . 14 +     7.1.  Backwards Compatibility Considerations . . . . . . . . . . 14 +     7.2.  Forward Compatibility Considerations . . . . . . . . . . . 15 +   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 15 +   9.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 16 +   10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16 +     10.1. Normative References . . . . . . . . . . . . . . . . . . . 16 +     10.2. Informative References . . . . . . . . . . . . . . . . . . 17 + +1.  Introduction + +1.1.  Purpose and Applicability + +   The purpose of this document is to establish a rule that can be +   applied to Internationalized Domain Name (IDN) labels in Unicode form +   (U-labels) containing characters from scripts that are written from +   right to left.  It is part of the revised IDNA protocol [RFC5891]. + +   When labels satisfy the rule, and when certain other conditions are +   satisfied, there is only a minimal chance of these labels being +   displayed in a confusing way by the Unicode bidirectional display +   algorithm. + +   The other normative documents in the IDNA2008 document set establish +   criteria for valid labels, including listing the permitted +   characters.  This document establishes additional validity criteria +   for labels in scripts normally written from right to left. + +   This specification is not intended to place any requirements on +   domain names that do not contain characters from such scripts. + + + + + + +Alvestrand & Karp            Standards Track                    [Page 2] + +RFC 5893                   IDNA Right to Left                August 2010 + + +1.2.  Background and History + +   The "Stringprep" specification [RFC3454], part of IDNA2003, made the +   following statement in its Section 6 on the Bidi algorithm: + +      3) If a string contains any RandALCat character, a RandALCat +      character MUST be the first character of the string, and a +      RandALCat character MUST be the last character of the string. + +   (A RandALCat character is a character with unambiguously +   right-to-left directionality.) + +   The reasoning behind this prohibition was to ensure that every +   component of a displayed domain name has an unambiguously preferred +   direction.  However, this made certain words in languages written +   with right-to-left scripts invalid as IDN labels, and in at least one +   case (Dhivehi) meant that all the words of an entire language were +   forbidden as IDN labels. + +   This is illustrated below with examples taken from the Dhivehi and +   Yiddish languages, as written with the Thaana and Hebrew scripts, +   respectively. + +   RFC 3454 did not explicitly state the requirement to be fulfilled. +   Therefore, it is impossible to determine whether a simple relaxation +   of the rule would continue to fulfill the requirement. + +   While this document specifies rules quite different from RFC 3454, +   most reasonable labels that were allowed under RFC 3454 will also be +   allowed under this specification (the most important example of +   non-permitted labels being labels that mix Arabic and European digits +   (AN and EN) inside an RTL label, and labels that use AN in an LTR +   label -- see Section 1.4 for terminology), so the operational impact +   of using the new rule in the updated IDNA specification is limited. + +1.3.  Structure of the Rest of This Document + +   Section 2 defines a rule, the "Bidi rule", which can be used on a +   domain name label to check how safe it is to use in a domain name of +   possibly mixed directionality.  The primary initial use of this rule +   is as part of the IDNA2008 protocol [RFC5891]. + +   Section 3 sets out the requirements for defining the Bidi rule. + +   Section 4 gives detailed examples that serve as justification for the +   new rule. + + + + + +Alvestrand & Karp            Standards Track                    [Page 3] + +RFC 5893                   IDNA Right to Left                August 2010 + + +   Section 5 to Section 8 describe various situations that can occur +   when dealing with domain names with characters of different +   directionality. + +   Only Section 1.4 and Section 2 are normative. + +1.4.  Terminology + +   The terminology used to describe IDNA concepts is defined in the +   Definitions document [RFC5890]. + +   The terminology used for the Bidi properties of Unicode characters is +   taken from the Unicode Standard [Unicode52]. + +   The Unicode Standard specifies a Bidi property for each character. +   That property controls the character's behavior in the Unicode +   bidirectional algorithm [Unicode-UAX9].  For reference, here are the +   values that the Unicode Bidi property can have: + +   o  L - Left to right - most letters in LTR scripts + +   o  R - Right to left - most letters in non-Arabic RTL scripts + +   o  AL - Arabic letters - most letters in the Arabic script + +   o  EN - European Number (0-9, and Extended Arabic-Indic numbers) + +   o  ES - European Number Separator (+ and -) + +   o  ET - European Number Terminator (currency symbols, the hash sign, +      the percent sign and so on) + +   o  AN - Arabic Number; this encompasses the Arabic-Indic numbers, but +      not the Extended Arabic-Indic numbers + +   o  CS - Common Number Separator (. , / : et al) + +   o  NSM - Nonspacing Mark - most combining accents + +   o  BN - Boundary Neutral - control characters (ZWNJ, ZWJ, and others) + +   o  B - Paragraph Separator + +   o  S - Segment Separator + +   o  WS - Whitespace, including the SPACE character + +   o  ON - Other Neutrals, including @, &, parentheses, MIDDLE DOT + + + +Alvestrand & Karp            Standards Track                    [Page 4] + +RFC 5893                   IDNA Right to Left                August 2010 + + +   o  LRE, LRO, RLE, RLO, PDF - these are "directional control +      characters" and are not used in IDNA labels. + +   In this memo, we use "network order" to describe the sequence of +   characters as transmitted on the wire or stored in a file; the terms +   "first", "next", "previous", "beginning", "end", "before", and +   "after" are used to refer to the relationship of characters and +   labels in network order. + +   We use "display order" to talk about the sequence of characters as +   imaged on a display medium; the terms "left" and "right" are used to +   refer to the relationship of characters and labels in display order. + +   Most of the time, the examples use the abbreviations for the Unicode +   Bidi classes to denote the directionality of the characters; the +   example string CS L consists of one character of class CS and one +   character of class L.  In some examples, the convention that +   uppercase characters are of class R or AL, and lowercase characters +   are of class L is used -- thus, the example string ABC.abc would +   consist of three right-to-left characters and three left-to-right +   characters. + +   The directionality of such examples is determined by context -- for +   instance, in the sentence "ABC.abc is displayed as CBA.abc", the +   first example string is in network order, the second example string +   is in display order. + +   The term "paragraph" is used in the sense of the Unicode Bidi +   specification [Unicode-UAX9].  It means "a block of text that has an +   overall direction, either left to right or right to left", +   approximately; see the "Unicode Bidirectional Algorithm" +   [Unicode-UAX9] for details. + +   "RTL" and "LTR" are abbreviations for "right to left" and "left to +   right", respectively. + +   An RTL label is a label that contains at least one character of type +   R, AL, or AN. + +   An LTR label is any label that is not an RTL label. + +   A "Bidi domain name" is a domain name that contains at least one RTL +   label.  (Note: This definition includes domain names containing only +   dots and right-to-left characters.  Providing a separate category of +   "RTL domain names" would not make this specification simpler, so it +   has not been done.) + + + + + +Alvestrand & Karp            Standards Track                    [Page 5] + +RFC 5893                   IDNA Right to Left                August 2010 + + +2.  The Bidi Rule + +   The following rule, consisting of six conditions, applies to labels +   in Bidi domain names.  The requirements that this rule satisfies are +   described in Section 3.  All of the conditions must be satisfied for +   the rule to be satisfied. + +   1.  The first character must be a character with Bidi property L, R, +       or AL.  If it has the R or AL property, it is an RTL label; if it +       has the L property, it is an LTR label. + +   2.  In an RTL label, only characters with the Bidi properties R, AL, +       AN, EN, ES, CS, ET, ON, BN, or NSM are allowed. + +   3.  In an RTL label, the end of the label must be a character with +       Bidi property R, AL, EN, or AN, followed by zero or more +       characters with Bidi property NSM. + +   4.  In an RTL label, if an EN is present, no AN may be present, and +       vice versa. + +   5.  In an LTR label, only characters with the Bidi properties L, EN, +       ES, CS, ET, ON, BN, or NSM are allowed. + +   6.  In an LTR label, the end of the label must be a character with +       Bidi property L or EN, followed by zero or more characters with +       Bidi property NSM. + +   The following guarantees can be made based on the above: + +   o  In a domain name consisting of only labels that satisfy the rule, +      the requirements of Section 3 are satisfied.  Note that even LTR +      labels and pure ASCII labels have to be tested. + +   o  In a domain name consisting of only LDH labels (as defined in the +      Definitions document [RFC5890]) and labels that satisfy the rule, +      the requirements of Section 3 are satisfied as long as a label +      that starts with an ASCII digit does not come after a +      right-to-left label. + +   No guarantee is given for other combinations. + +3.  The Requirement Set for the Bidi Rule + +   This document, unlike RFC 3454 [RFC3454], provides an explicit +   justification for the Bidi rule, and states a set of requirements for +   which it is possible to test whether or not the modified rule +   fulfills the requirement. + + + +Alvestrand & Karp            Standards Track                    [Page 6] + +RFC 5893                   IDNA Right to Left                August 2010 + + +   All the text in this document assumes that text containing the labels +   under consideration will be displayed using the Unicode bidirectional +   algorithm [Unicode-UAX9]. + +   The requirements proposed are these: + +   o  Label Uniqueness: No two labels, when presented in display order +      in the same paragraph, should have the same sequence of characters +      without also having the same sequence of characters in network +      order, both when the paragraph has LTR direction and when the +      paragraph has RTL direction.  (This is the criterion that is +      explicit in RFC 3454).  (Note that a label displayed in an RTL +      paragraph may display the same as a different label displayed in +      an LTR paragraph and still satisfy this criterion.) + +   o  Character Grouping: When displaying a string of labels, using the +      Unicode Bidi algorithm to reorder the characters for display, the +      characters of each label should remain grouped between the +      characters delimiting the labels, both when the string is embedded +      in a paragraph with LTR direction and when it is embedded in a +      paragraph with RTL direction. + +   Several stronger statements were considered and rejected, because +   they seem to be impossible to fulfill within the constraints of the +   Unicode bidirectional algorithm.  These include: + +   o  The appearance of a label should be unaffected by its embedding +      context.  This proved impossible even for ASCII labels; the label +      "123-A" will have a different display order in an RTL context than +      in an LTR context.  (This particular example is, however, +      disallowed anyway.) + +   o  The sequence of labels should be consistent with network order. +      This proved impossible -- a domain name consisting of the labels +      (in network order) L1.R2.R3.L4 will be displayed as L1.R3.R2.L4 in +      an LTR context.  (In an RTL context, it will be displayed as +      L4.R3.R2.L1). + +   o  No two domain names should be displayed the same, even under +      differing directionality.  This was shown to be unsound, since the +      domain name (in network order) ABC.abc will have display order +      CBA.abc in an LTR context and abc.CBA in an RTL context, while the +      domain name (network) abc.ABC will have display order abc.CBA in +      an LTR context and CBA.abc in an RTL context. + + + + + + + +Alvestrand & Karp            Standards Track                    [Page 7] + +RFC 5893                   IDNA Right to Left                August 2010 + + +   One possible requirement was thought to be problematic, but turned +   out to be satisfied by a string that obeys the proposed rules: + +   o  The Character Grouping requirement should be satisfied when +      directional controls (LRE, RLE, RLO, LRO, PDF) are used in the +      same paragraph (outside of the labels).  Because these controls +      affect presentation order in non-obvious ways, by affecting the +      "sor" and "eor" properties of the Unicode Bidi algorithm, the +      conditions above require extra testing in order to figure out +      whether or not they influence the display of the domain name. +      Testing found that for the strings allowed under the rule +      presented in this document, directional controls do not influence +      the display of the domain name. + +   This is still not stated as a requirement, since it did not seem as +   important as the stated requirements, but it is useful to know that +   Bidi domain names where the labels satisfy the rule have this +   property. + +   In the following descriptions, first-level bullets are used to +   indicate rules or normative statements; second-level bullets are +   commentary. + +   The Character Grouping requirement can be more formally stated as: + +   o  Let "Delimiterchars" be a set of characters with the Unicode Bidi +      properties CS, WS, ON.  (These are commonly used to delimit labels +      -- both the FULL STOP and the space are included.  They are not +      allowed in domain labels.) + +      *  ET, though it commonly occurs next to domain names in practice, +         is problematic: the context R CS L EN ET (for instance A.a1%) +         makes the label L EN not satisfy the character grouping +         requirement. + +      *  ES commonly occurs in labels as HYPHEN-MINUS, but could also be +         used as a delimiter (for instance, the plus sign).  It is left +         out here. + +   o  Let "unproblematic label" be a label that either satisfies the +      requirements or does not contain any character with the Bidi +      properties R, AL, or AN and does not begin with a character with +      the Bidi property EN.  (Informally, "it does not start with a +      number".) + + + + + + + +Alvestrand & Karp            Standards Track                    [Page 8] + +RFC 5893                   IDNA Right to Left                August 2010 + + +   A label X satisfies the Character Grouping requirement when, for any +   Delimiter Character D1 and D2, and for any label S1 and S2 that is an +   unproblematic label or an empty string, the following holds true: + +   If the string formed by concatenating S1, D1, X, D2, and S2 is +   reordered according to the Bidi algorithm, then all the characters of +   X in the reordered string are between D1 and D2, and no other +   characters are between D1 and D2, both if the overall paragraph +   direction is LTR and if the overall paragraph direction is RTL. + +   Note that the definition is self-referential, since S1 and S2 are +   constrained to be "legal" by this definition.  This makes testing +   changes to proposed rules a little complex, but does not create +   problems for testing whether or not a given proposed rule satisfies +   the criterion. + +   The "zero-length" case represents the case where a domain name is +   next to something that isn't a domain name, separated by a delimiter +   character. + +   Note about the position of BN: The Unicode bidirectional algorithm +   specifies that a BN has an effect on the adjoining characters in +   network order, not in display order, and are therefore treated as if +   removed during Bidi processing ([Unicode-UAX9], Section 3.3.2, rule +   X9 and Section 5.3).  Therefore, the question of "what position does +   a BN have after reordering" is not meaningful.  It has been ignored +   while developing the rules here. + +   The Label Uniqueness requirement can be formally stated as: + +   If two non-identical labels X and Y, embedded as for the test above, +   displayed in paragraphs with the same directionality, are reordered +   by the Bidi algorithm into the same sequence of code points, the +   labels X and Y cannot both be legal. + +4.  Examples of Issues Found with RFC 3454 + +4.1.  Dhivehi + +   Dhivehi, the official language of the Maldives, is written with the +   Thaana script.  This script displays some of the characteristics of +   the Arabic script, including its directional properties, and the +   indication of vowels by the diacritical marking of consonantal base +   characters.  This marking is obligatory, and both two consecutive +   vowels and syllable-final consonants are indicated with unvoiced +   combining marks.  Every Dhivehi word therefore ends with a combining +   mark. + + + + +Alvestrand & Karp            Standards Track                    [Page 9] + +RFC 5893                   IDNA Right to Left                August 2010 + + +   The word for "computer", which is romanized as "konpeetaru", is +   written with the following sequence of Unicode code points: + +      U+0786 THAANA LETTER KAAFU (AL) + +      U+07AE THAANA OBOFILI (NSM) + +      U+0782 THAANA LETTER NOONU (AL) + +      U+07B0 THAANA SUKUN (NSM) + +      U+0795 THAANA LETTER PAVIYANI (AL) + +      U+07A9 THAANA LETTER EEBEEFILI (AL) + +      U+0793 THAANA LETTER TAVIYANI (AL) + +      U+07A6 THAANA ABAFILI (NSM) + +      U+0783 THAANA LETTER RAA (AL) + +      U+07AA THAANA UBUFILI (NSM) + +   The directionality class of U+07AA in the Unicode database +   [Unicode52] is NSM (Nonspacing Mark), which is not R or AL; a +   conformant implementation of the IDNA2003 algorithm will say that +   "this is not in RandALCat" and refuse to encode the string. + +4.2.  Yiddish + +   Yiddish is one of several languages written with the Hebrew script +   (others include Hebrew and Ladino).  This is basically a consonantal +   alphabet (also termed an "abjad"), but Yiddish is written using an +   extended form that is fully vocalic.  The vowels are indicated in +   several ways, one of which is by repurposing letters that are +   consonants in Hebrew.  Other letters are used both as vowels and +   consonants, with combining marks, called "points", used to +   differentiate between them.  Finally, some base characters can +   indicate several different vowels, which are also disambiguated by +   combining marks.  Pointed characters can appear in word-final +   position and may therefore also be needed at the end of labels.  This +   is not an invariable attribute of a Yiddish string and there is thus +   greater latitude here than there is with Dhivehi. + +   The organization now known as the "YIVO Institute for Jewish +   Research" developed orthographic rules for modern Standard Yiddish +   during the 1930s on the basis of work conducted in several venues +   since earlier in that century.  These are given in, "The Standardized + + + +Alvestrand & Karp            Standards Track                   [Page 10] + +RFC 5893                   IDNA Right to Left                August 2010 + + +   Yiddish Orthography: Rules of Yiddish Spelling" [SYO], and are taken +   as normatively descriptive of modern Standard Yiddish in any context +   where that notion is deemed relevant.  They have been applied +   exclusively in all formal Yiddish dictionaries published since their +   establishment, and are similarly dominant in academic and +   bibliographic regards. + +   It therefore appears appropriate for this repertoire also to be +   supported fully by IDNA.  This presents no difficulty with characters +   in initial and medial positions, but pointed characters are regularly +   used in final position as well.  All of the characters in the SYO +   repertoire appear in both marked and unmarked form with one +   exception: the HEBREW LETTER PE (U+05E4).  The SYO only permits this +   with a HEBREW POINT DAGESH (U+05BC), providing the Yiddish equivalent +   to the Latin letter "p", or a HEBREW POINT RAFE (U+05BF), equivalent +   to the Latin letter "f".  There is, however, a separate unpointed +   allograph, the HEBREW LETTER FINAL PE (U+05E3), for the latter +   character when it appears in final position.  The constraint on the +   use of the SYO repertoire resulting from the proscription of +   combining marks at the end of RTL strings thus reduces to nothing +   more, or less, than the equivalent of saying that a string of Latin +   characters cannot end with the letter "p".  It must also be noted +   that the HEBREW LETTER PE with the HEBREW POINT DAGESH is +   characteristic of almost all traditional Yiddish orthographies that +   predate (or remain in use in parallel to) the SYO, being the first +   pointed character to appear in any of them. + +   A more general instantiation of the basic problem can be seen in the +   representation of the YIVO acronym.  This acronym is written with the +   Hebrew letters YOD YOD HIRIQ VAV VAV ALEF QAMATS, where HIRIQ and +   QAMATS are combining points.  The Unicode code points are: + +      U+05D9 HEBREW LETTER YOD (R) + +      U+05B4 HEBREW POINT HIRIQ (NSM) + +      U+05D5 HEBREW LETTER VAV (R) + +      U+05D0 HEBREW LETTER ALEF (R) + +      U+05B8 HEBREW POINT QAMATS (NSM) + +   The directionality class of U+05B8 HEBREW POINT QAMATS in the Unicode +   database is NSM, which again causes the IDNA2003 algorithm to reject +   the string. + + + + + + +Alvestrand & Karp            Standards Track                   [Page 11] + +RFC 5893                   IDNA Right to Left                August 2010 + + +   It may also be noted that all of the combined characters mentioned +   above exist in precomposed form at separate positions in the Unicode +   chart.  However, by invoking Stringprep, the IDNA2003 algorithm also +   rejects those code points, for reasons not discussed here. + +4.3.  Strings with Numbers + +   By requiring that the first or last character of a string be a member +   of category R or AL, the Stringprep specification [RFC3454] +   prohibited a string containing right-to-left characters from ending +   with a number. + +   Consider the strings ALEF 5 (HEBREW LETTER ALEF + DIGIT FIVE) and 5 +   ALEF.  Displayed in an LTR context, the first one will be displayed +   from left to right as 5 ALEF (with the 5 being considered right to +   left because of the leading ALEF), while 5 ALEF will be displayed in +   exactly the same order (5 taking the direction from context). +   Clearly, only one of those should be permitted as a registered label, +   but barring them both seems unnecessary. + +5.  Troublesome Situations and Guidelines + +   There are situations in which labels that satisfy the rule above will +   be displayed in a surprising fashion.  The most important of these is +   the case where a label ending in a character with Bidi property AL, +   AN, or R occurs before a label beginning with a character of Bidi +   property EN.  In that case, the number will appear to move into the +   label containing the right-to-left character, violating the Character +   Grouping requirement. + +   If the label that occurs after the right-to-left label itself +   satisfies the Bidi criterion, the requirements will be satisfied in +   all cases (this is the reason why the criterion talks about strings +   containing L in some cases).  However, the IDNABIS WG concluded that +   this could not be required for several reasons: + +   o  There is a large current deployment of ASCII domain names starting +      with digits.  These cannot possibly be invalidated. + +   o  Domain names are often constructed piecemeal, for instance, by +      combining a string with the content of a search list.  This may +      occur after IDNA processing, and thus in part of the code that is +      not IDNA-aware, making detection of the undesirable combination +      impossible. + + + + + + + +Alvestrand & Karp            Standards Track                   [Page 12] + +RFC 5893                   IDNA Right to Left                August 2010 + + +   o  Even if a label is registered under a "safe" label, there may be a +      DNAME [RFC2672] with an "unsafe" label that points to the "safe" +      label, thus creating seemingly valid names that would not satisfy +      the criterion. + +   o  Wildcards create the odd situation where a label is "valid" (can +      be looked up successfully) without the zone owner knowing that +      this label exists.  So an owner of a zone whose name starts with a +      digit and contains a wildcard has no way of controlling whether or +      not names with RTL labels in them are looked up in his zone. + +   Rather than trying to suggest rules that disallow all such +   undesirable situations, this document merely warns about the +   possibility, and leaves it to application developers to take whatever +   measures they deem appropriate to avoid problematic situations. + +6.  Other Issues in Need of Resolution + +   This document concerns itself only with the rules that are needed +   when dealing with domain names with characters that have differing +   Bidi properties, and considers characters only in terms of their Bidi +   properties.  All other issues with scripts that are written from +   right to left must be considered in other contexts. + +   One such issue is the need to keep numbers separate.  Several scripts +   are used with multiple sets of numbers -- most commonly they use +   Latin numbers and a script-specific set of numbers, but in the case +   of Arabic, there are two sets of "Arabic-Indic" digits involved. + +   The algorithm in this document disallows occurrences of AN-class +   characters ("Arabic-Indic digits", U+0660 to U+0669) together with +   EN-class characters (which includes "European" digits, U+0030 to +   U+0039 and "extended Arabic-Indic digits", U+06F0 to U+06F9), but +   does not help in preventing the mixing of, for instance, Bengali +   digits (U+09E6 to U+09EF) and Gujarati digits (U+0AE6 to U+0AEF), +   both of which have Bidi class L.  A registry or script community that +   wishes to create rules restricting the mixing of digits in a label +   will be able to specify these restrictions at the registry level. +   Some rules are also specified at the protocol level. + +   Another set of issues concerns the proper display of IDNs with a +   mixture of LTR and RTL labels, or only RTL labels. + +   It is unrealistic to expect that applications will display domain +   names using embedded formatting codes between their labels (for one +   thing, no reliable algorithms for identifying domain names in running +   text exist); thus, the display order will be determined by the Bidi +   algorithm.  Thus, a sequence (in network order) of R1.R2.ltr will be + + + +Alvestrand & Karp            Standards Track                   [Page 13] + +RFC 5893                   IDNA Right to Left                August 2010 + + +   displayed in the order 2R.1R.ltr in an LTR context, which might +   surprise someone expecting to see labels displayed in hierarchical +   order.  People used to working with text that mixes LTR and RTL +   strings might not be so surprised by this.  Again, this memo does not +   attempt to suggest a solution to this problem. + +7.  Compatibility Considerations + +7.1.  Backwards Compatibility Considerations + +   As with any change to an existing standard, it is important to +   consider what happens with existing implementations when the change +   is introduced.  Some troublesome cases include: + +   o  An old program used to input the newly allowed label.  If the old +      program checks the input against RFC 3454, some labels will not be +      allowed, and domain names containing those labels will remain +      inaccessible. + +   o  An old program is asked to display the newly allowed label, and +      checks it against RFC 3454 before displaying.  The program will +      perform some kind of fallback, most likely displaying the label in +      A-label form. + +   o  An old program tries to display the newly allowed label.  If the +      old program has code for displaying the last character of a label +      that is different from the code used to display the characters in +      the middle of the label, the display may be inconsistent and cause +      confusion. + +   One particular example of the last case is if a program chooses to +   examine the last character (in network order) of a string in order to +   determine its directionality, rather than its first.  If it finds an +   NSM character and tries to display the string as if it was a +   left-to-right string, the resulting display may be interesting, but +   not useful. + +   The editors believe that these cases will have a less harmful impact +   in practice than continuing to deny the use of words from the +   languages for which these strings are necessary as IDN labels. + +   This specification does not forbid using leading European digits in +   ASCII-only labels, since this would conflict with a large installed +   base of such labels, and would increase the scope of the +   specification from RTL labels to all labels.  The harm resulting from +   this limitation of scope is described in Section 5.  Registries and +   private zone managers can check for this particular condition before +   they allow registration of any RTL label.  Generally, it is best to + + + +Alvestrand & Karp            Standards Track                   [Page 14] + +RFC 5893                   IDNA Right to Left                August 2010 + + +   disallow registration of any right-to-left strings in a zone where +   the label at the level above begins with a digit. + +7.2.  Forward Compatibility Considerations + +   This text is intentionally specified strictly in terms of the Unicode +   Bidi properties.  The determination that the condition is sufficient +   to fulfill the criteria depends on the Unicode Bidi algorithm; it is +   unlikely that drastic changes will be made to this algorithm. + +   However, the determination of validity for any string depends on the +   Unicode Bidi property values, which are not declared immutable by the +   Unicode Consortium.  Furthermore, the behavior of the algorithm for +   any given character is likely to be linguistically and culturally +   sensitive, so while it should occur rarely, it is possible that later +   versions of the Unicode Standard may change the Bidi properties +   assigned to certain Unicode characters. + +   This memo does not propose a solution for this problem. + +8.  Security Considerations + +   The display behavior of mixed-direction text can be extremely +   surprising to users who are not used to it; for instance, cut and +   paste of a piece of text can cause the text to display differently at +   the destination, if the destination is in another directionality +   context, and adding a character in one place of a text can cause +   characters some distance from the point of insertion to change their +   display position.  This is, however, not a phenomenon unique to the +   display of domain names. + +   The new IDNA protocol, and particularly these new Bidi rules, will +   allow some strings to be used in IDNA contexts that are not allowed +   today.  It is possible that differences in the interpretation of +   labels between implementations of IDNA2003 and IDNA2008 could pose a +   security risk, but it is difficult to envision any specific +   instantiation of this. + +   Any rational attempt to compute, for instance, a hash over an +   identifier processed by IDNA would use network order for its +   computation, and thus be unaffected by the new rules proposed here. + +   While it is not believed to pose a problem, if display routines had +   been written with specific knowledge of the RFC 3454 IDNA +   prohibitions, it is possible that the potential problems noted under +   "Backwards Compatibility Considerations" could cause new kinds of +   confusion. + + + + +Alvestrand & Karp            Standards Track                   [Page 15] + +RFC 5893                   IDNA Right to Left                August 2010 + + +9.  Acknowledgements + +   While the listed editors held the pen, this document represents the +   joint work and conclusions of an ad hoc design team.  In addition to +   the editors, this consisted of, in alphabetic order, Tina Dam, Patrik +   Faltstrom, and John Klensin.  Many further specific contributions and +   helpful comments were received from the people listed below, and +   others who have contributed to the development and use of the IDNA +   protocols. + +   The particular formulation of the Bidi rule in Section 2 was +   suggested by Matitiahu Allouche. + +   The team wishes, in particular, to thank Roozbeh Pournader for +   calling its attention to the issue with the Thaana script, Paul +   Hoffman for pointing out the need to be explicit about backwards +   compatibility considerations, Ken Whistler for suggesting the basis +   of the formalized "Character Grouping" requirement, Mark Davis for +   commentary, Erik van der Poel for careful review, comments, and +   verification of the rulesets, Marcos Sanz, Andrew Sullivan, and Pete +   Resnick for reviews, and Vint Cerf for chairing the working group and +   contributing massively to getting the documents finished. + +10.  References + +10.1.  Normative References + +   [RFC5890]      Klensin, J., "Internationalized Domain Names for +                  Applications (IDNA): Definitions and Document +                  Framework", RFC 5890, August 2010. + +   [Unicode-UAX9] The Unicode Consortium, "Unicode Standard Annex #9: +                  Unicode Bidirectional Algorithm", September 2009, +                  <http://www.unicode.org/reports/tr9/>. + +   [Unicode52]    The Unicode Consortium.  The Unicode Standard, Version +                  5.2.0, defined by: "The Unicode Standard, Version +                  5.2.0", (Mountain View, CA: The Unicode Consortium, +                  2009. ISBN 978-1-936213-00-9). +                  <http://www.unicode.org/versions/Unicode5.2.0/>. + + + + + + + + + + + +Alvestrand & Karp            Standards Track                   [Page 16] + +RFC 5893                   IDNA Right to Left                August 2010 + + +10.2.  Informative References + +   [RFC2672]      Crawford, M., "Non-Terminal DNS Name Redirection", +                  RFC 2672, August 1999. + +   [RFC3454]      Hoffman, P. and M. Blanchet, "Preparation of +                  Internationalized Strings ("stringprep")", RFC 3454, +                  December 2002. + +   [RFC5891]      Klensin, J., "Internationalized Domain Names in +                  Applications (IDNA): Protocol", RFC 5891, August 2010. + +   [SYO]          "The Standardized Yiddish Orthography: Rules of +                  Yiddish Spelling, 6th ed., New York, ISBN +                  0-914512-25-0", 1999. + +Authors' Addresses + +   Harald Tveit Alvestrand (editor) +   Google +   Beddingen 10 +   Trondheim,   7014 +   Norway + +   EMail: harald@alvestrand.no + + +   Cary Karp +   Swedish Museum of Natural History +   Frescativ. 40 +   Stockholm,   10405 +   Sweden + +   Phone: +46 8 5195 4055 +   Fax: +   EMail: ck@nic.museum + + + + + + + + + + + + + + + +Alvestrand & Karp            Standards Track                   [Page 17] + |