summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc5893.txt
diff options
context:
space:
mode:
authorThomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
committerThomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
commit4bfd864f10b68b71482b35c818559068ef8d5797 (patch)
treee3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc5893.txt
parentea76e11061bda059ae9f9ad130a9895cc85607db (diff)
doc: Add RFC documents
Diffstat (limited to 'doc/rfc/rfc5893.txt')
-rw-r--r--doc/rfc/rfc5893.txt955
1 files changed, 955 insertions, 0 deletions
diff --git a/doc/rfc/rfc5893.txt b/doc/rfc/rfc5893.txt
new file mode 100644
index 0000000..c76dfba
--- /dev/null
+++ b/doc/rfc/rfc5893.txt
@@ -0,0 +1,955 @@
+
+
+
+
+
+
+Internet Engineering Task Force (IETF) H. Alvestrand, Ed.
+Request for Comments: 5893 Google
+Category: Standards Track C. Karp
+ISSN: 2070-1721 Swedish Museum of Natural History
+ August 2010
+
+
+ Right-to-Left Scripts for
+ Internationalized Domain Names for Applications (IDNA)
+
+Abstract
+
+ The use of right-to-left scripts in Internationalized Domain Names
+ (IDNs) has presented several challenges. This memo provides a new
+ Bidi rule for Internationalized Domain Names for Applications (IDNA)
+ labels, based on the encountered problems with some scripts and some
+ shortcomings in the 2003 IDNA Bidi criterion.
+
+Status of This Memo
+
+ This is an Internet Standards Track document.
+
+ This document is a product of the Internet Engineering Task Force
+ (IETF). It represents the consensus of the IETF community. It has
+ received public review and has been approved for publication by the
+ Internet Engineering Steering Group (IESG). Further information on
+ Internet Standards is available in Section 2 of RFC 5741.
+
+ Information about the current status of this document, any errata,
+ and how to provide feedback on it may be obtained at
+ http://www.rfc-editor.org/info/rfc5893.
+
+Copyright Notice
+
+ Copyright (c) 2010 IETF Trust and the persons identified as the
+ document authors. All rights reserved.
+
+ This document is subject to BCP 78 and the IETF Trust's Legal
+ Provisions Relating to IETF Documents
+ (http://trustee.ietf.org/license-info) in effect on the date of
+ publication of this document. Please review these documents
+ carefully, as they describe your rights and restrictions with respect
+ to this document. Code Components extracted from this document must
+ include Simplified BSD License text as described in Section 4.e of
+ the Trust Legal Provisions and are provided without warranty as
+ described in the Simplified BSD License.
+
+
+
+
+
+Alvestrand & Karp Standards Track [Page 1]
+
+RFC 5893 IDNA Right to Left August 2010
+
+
+Table of Contents
+
+ 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2
+ 1.1. Purpose and Applicability . . . . . . . . . . . . . . . . 2
+ 1.2. Background and History . . . . . . . . . . . . . . . . . . 3
+ 1.3. Structure of the Rest of This Document . . . . . . . . . . 3
+ 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4
+ 2. The Bidi Rule . . . . . . . . . . . . . . . . . . . . . . . . 6
+ 3. The Requirement Set for the Bidi Rule . . . . . . . . . . . . 6
+ 4. Examples of Issues Found with RFC 3454 . . . . . . . . . . . . 9
+ 4.1. Dhivehi . . . . . . . . . . . . . . . . . . . . . . . . . 9
+ 4.2. Yiddish . . . . . . . . . . . . . . . . . . . . . . . . . 10
+ 4.3. Strings with Numbers . . . . . . . . . . . . . . . . . . . 12
+ 5. Troublesome Situations and Guidelines . . . . . . . . . . . . 12
+ 6. Other Issues in Need of Resolution . . . . . . . . . . . . . . 13
+ 7. Compatibility Considerations . . . . . . . . . . . . . . . . . 14
+ 7.1. Backwards Compatibility Considerations . . . . . . . . . . 14
+ 7.2. Forward Compatibility Considerations . . . . . . . . . . . 15
+ 8. Security Considerations . . . . . . . . . . . . . . . . . . . 15
+ 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 16
+ 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16
+ 10.1. Normative References . . . . . . . . . . . . . . . . . . . 16
+ 10.2. Informative References . . . . . . . . . . . . . . . . . . 17
+
+1. Introduction
+
+1.1. Purpose and Applicability
+
+ The purpose of this document is to establish a rule that can be
+ applied to Internationalized Domain Name (IDN) labels in Unicode form
+ (U-labels) containing characters from scripts that are written from
+ right to left. It is part of the revised IDNA protocol [RFC5891].
+
+ When labels satisfy the rule, and when certain other conditions are
+ satisfied, there is only a minimal chance of these labels being
+ displayed in a confusing way by the Unicode bidirectional display
+ algorithm.
+
+ The other normative documents in the IDNA2008 document set establish
+ criteria for valid labels, including listing the permitted
+ characters. This document establishes additional validity criteria
+ for labels in scripts normally written from right to left.
+
+ This specification is not intended to place any requirements on
+ domain names that do not contain characters from such scripts.
+
+
+
+
+
+
+Alvestrand & Karp Standards Track [Page 2]
+
+RFC 5893 IDNA Right to Left August 2010
+
+
+1.2. Background and History
+
+ The "Stringprep" specification [RFC3454], part of IDNA2003, made the
+ following statement in its Section 6 on the Bidi algorithm:
+
+ 3) If a string contains any RandALCat character, a RandALCat
+ character MUST be the first character of the string, and a
+ RandALCat character MUST be the last character of the string.
+
+ (A RandALCat character is a character with unambiguously
+ right-to-left directionality.)
+
+ The reasoning behind this prohibition was to ensure that every
+ component of a displayed domain name has an unambiguously preferred
+ direction. However, this made certain words in languages written
+ with right-to-left scripts invalid as IDN labels, and in at least one
+ case (Dhivehi) meant that all the words of an entire language were
+ forbidden as IDN labels.
+
+ This is illustrated below with examples taken from the Dhivehi and
+ Yiddish languages, as written with the Thaana and Hebrew scripts,
+ respectively.
+
+ RFC 3454 did not explicitly state the requirement to be fulfilled.
+ Therefore, it is impossible to determine whether a simple relaxation
+ of the rule would continue to fulfill the requirement.
+
+ While this document specifies rules quite different from RFC 3454,
+ most reasonable labels that were allowed under RFC 3454 will also be
+ allowed under this specification (the most important example of
+ non-permitted labels being labels that mix Arabic and European digits
+ (AN and EN) inside an RTL label, and labels that use AN in an LTR
+ label -- see Section 1.4 for terminology), so the operational impact
+ of using the new rule in the updated IDNA specification is limited.
+
+1.3. Structure of the Rest of This Document
+
+ Section 2 defines a rule, the "Bidi rule", which can be used on a
+ domain name label to check how safe it is to use in a domain name of
+ possibly mixed directionality. The primary initial use of this rule
+ is as part of the IDNA2008 protocol [RFC5891].
+
+ Section 3 sets out the requirements for defining the Bidi rule.
+
+ Section 4 gives detailed examples that serve as justification for the
+ new rule.
+
+
+
+
+
+Alvestrand & Karp Standards Track [Page 3]
+
+RFC 5893 IDNA Right to Left August 2010
+
+
+ Section 5 to Section 8 describe various situations that can occur
+ when dealing with domain names with characters of different
+ directionality.
+
+ Only Section 1.4 and Section 2 are normative.
+
+1.4. Terminology
+
+ The terminology used to describe IDNA concepts is defined in the
+ Definitions document [RFC5890].
+
+ The terminology used for the Bidi properties of Unicode characters is
+ taken from the Unicode Standard [Unicode52].
+
+ The Unicode Standard specifies a Bidi property for each character.
+ That property controls the character's behavior in the Unicode
+ bidirectional algorithm [Unicode-UAX9]. For reference, here are the
+ values that the Unicode Bidi property can have:
+
+ o L - Left to right - most letters in LTR scripts
+
+ o R - Right to left - most letters in non-Arabic RTL scripts
+
+ o AL - Arabic letters - most letters in the Arabic script
+
+ o EN - European Number (0-9, and Extended Arabic-Indic numbers)
+
+ o ES - European Number Separator (+ and -)
+
+ o ET - European Number Terminator (currency symbols, the hash sign,
+ the percent sign and so on)
+
+ o AN - Arabic Number; this encompasses the Arabic-Indic numbers, but
+ not the Extended Arabic-Indic numbers
+
+ o CS - Common Number Separator (. , / : et al)
+
+ o NSM - Nonspacing Mark - most combining accents
+
+ o BN - Boundary Neutral - control characters (ZWNJ, ZWJ, and others)
+
+ o B - Paragraph Separator
+
+ o S - Segment Separator
+
+ o WS - Whitespace, including the SPACE character
+
+ o ON - Other Neutrals, including @, &, parentheses, MIDDLE DOT
+
+
+
+Alvestrand & Karp Standards Track [Page 4]
+
+RFC 5893 IDNA Right to Left August 2010
+
+
+ o LRE, LRO, RLE, RLO, PDF - these are "directional control
+ characters" and are not used in IDNA labels.
+
+ In this memo, we use "network order" to describe the sequence of
+ characters as transmitted on the wire or stored in a file; the terms
+ "first", "next", "previous", "beginning", "end", "before", and
+ "after" are used to refer to the relationship of characters and
+ labels in network order.
+
+ We use "display order" to talk about the sequence of characters as
+ imaged on a display medium; the terms "left" and "right" are used to
+ refer to the relationship of characters and labels in display order.
+
+ Most of the time, the examples use the abbreviations for the Unicode
+ Bidi classes to denote the directionality of the characters; the
+ example string CS L consists of one character of class CS and one
+ character of class L. In some examples, the convention that
+ uppercase characters are of class R or AL, and lowercase characters
+ are of class L is used -- thus, the example string ABC.abc would
+ consist of three right-to-left characters and three left-to-right
+ characters.
+
+ The directionality of such examples is determined by context -- for
+ instance, in the sentence "ABC.abc is displayed as CBA.abc", the
+ first example string is in network order, the second example string
+ is in display order.
+
+ The term "paragraph" is used in the sense of the Unicode Bidi
+ specification [Unicode-UAX9]. It means "a block of text that has an
+ overall direction, either left to right or right to left",
+ approximately; see the "Unicode Bidirectional Algorithm"
+ [Unicode-UAX9] for details.
+
+ "RTL" and "LTR" are abbreviations for "right to left" and "left to
+ right", respectively.
+
+ An RTL label is a label that contains at least one character of type
+ R, AL, or AN.
+
+ An LTR label is any label that is not an RTL label.
+
+ A "Bidi domain name" is a domain name that contains at least one RTL
+ label. (Note: This definition includes domain names containing only
+ dots and right-to-left characters. Providing a separate category of
+ "RTL domain names" would not make this specification simpler, so it
+ has not been done.)
+
+
+
+
+
+Alvestrand & Karp Standards Track [Page 5]
+
+RFC 5893 IDNA Right to Left August 2010
+
+
+2. The Bidi Rule
+
+ The following rule, consisting of six conditions, applies to labels
+ in Bidi domain names. The requirements that this rule satisfies are
+ described in Section 3. All of the conditions must be satisfied for
+ the rule to be satisfied.
+
+ 1. The first character must be a character with Bidi property L, R,
+ or AL. If it has the R or AL property, it is an RTL label; if it
+ has the L property, it is an LTR label.
+
+ 2. In an RTL label, only characters with the Bidi properties R, AL,
+ AN, EN, ES, CS, ET, ON, BN, or NSM are allowed.
+
+ 3. In an RTL label, the end of the label must be a character with
+ Bidi property R, AL, EN, or AN, followed by zero or more
+ characters with Bidi property NSM.
+
+ 4. In an RTL label, if an EN is present, no AN may be present, and
+ vice versa.
+
+ 5. In an LTR label, only characters with the Bidi properties L, EN,
+ ES, CS, ET, ON, BN, or NSM are allowed.
+
+ 6. In an LTR label, the end of the label must be a character with
+ Bidi property L or EN, followed by zero or more characters with
+ Bidi property NSM.
+
+ The following guarantees can be made based on the above:
+
+ o In a domain name consisting of only labels that satisfy the rule,
+ the requirements of Section 3 are satisfied. Note that even LTR
+ labels and pure ASCII labels have to be tested.
+
+ o In a domain name consisting of only LDH labels (as defined in the
+ Definitions document [RFC5890]) and labels that satisfy the rule,
+ the requirements of Section 3 are satisfied as long as a label
+ that starts with an ASCII digit does not come after a
+ right-to-left label.
+
+ No guarantee is given for other combinations.
+
+3. The Requirement Set for the Bidi Rule
+
+ This document, unlike RFC 3454 [RFC3454], provides an explicit
+ justification for the Bidi rule, and states a set of requirements for
+ which it is possible to test whether or not the modified rule
+ fulfills the requirement.
+
+
+
+Alvestrand & Karp Standards Track [Page 6]
+
+RFC 5893 IDNA Right to Left August 2010
+
+
+ All the text in this document assumes that text containing the labels
+ under consideration will be displayed using the Unicode bidirectional
+ algorithm [Unicode-UAX9].
+
+ The requirements proposed are these:
+
+ o Label Uniqueness: No two labels, when presented in display order
+ in the same paragraph, should have the same sequence of characters
+ without also having the same sequence of characters in network
+ order, both when the paragraph has LTR direction and when the
+ paragraph has RTL direction. (This is the criterion that is
+ explicit in RFC 3454). (Note that a label displayed in an RTL
+ paragraph may display the same as a different label displayed in
+ an LTR paragraph and still satisfy this criterion.)
+
+ o Character Grouping: When displaying a string of labels, using the
+ Unicode Bidi algorithm to reorder the characters for display, the
+ characters of each label should remain grouped between the
+ characters delimiting the labels, both when the string is embedded
+ in a paragraph with LTR direction and when it is embedded in a
+ paragraph with RTL direction.
+
+ Several stronger statements were considered and rejected, because
+ they seem to be impossible to fulfill within the constraints of the
+ Unicode bidirectional algorithm. These include:
+
+ o The appearance of a label should be unaffected by its embedding
+ context. This proved impossible even for ASCII labels; the label
+ "123-A" will have a different display order in an RTL context than
+ in an LTR context. (This particular example is, however,
+ disallowed anyway.)
+
+ o The sequence of labels should be consistent with network order.
+ This proved impossible -- a domain name consisting of the labels
+ (in network order) L1.R2.R3.L4 will be displayed as L1.R3.R2.L4 in
+ an LTR context. (In an RTL context, it will be displayed as
+ L4.R3.R2.L1).
+
+ o No two domain names should be displayed the same, even under
+ differing directionality. This was shown to be unsound, since the
+ domain name (in network order) ABC.abc will have display order
+ CBA.abc in an LTR context and abc.CBA in an RTL context, while the
+ domain name (network) abc.ABC will have display order abc.CBA in
+ an LTR context and CBA.abc in an RTL context.
+
+
+
+
+
+
+
+Alvestrand & Karp Standards Track [Page 7]
+
+RFC 5893 IDNA Right to Left August 2010
+
+
+ One possible requirement was thought to be problematic, but turned
+ out to be satisfied by a string that obeys the proposed rules:
+
+ o The Character Grouping requirement should be satisfied when
+ directional controls (LRE, RLE, RLO, LRO, PDF) are used in the
+ same paragraph (outside of the labels). Because these controls
+ affect presentation order in non-obvious ways, by affecting the
+ "sor" and "eor" properties of the Unicode Bidi algorithm, the
+ conditions above require extra testing in order to figure out
+ whether or not they influence the display of the domain name.
+ Testing found that for the strings allowed under the rule
+ presented in this document, directional controls do not influence
+ the display of the domain name.
+
+ This is still not stated as a requirement, since it did not seem as
+ important as the stated requirements, but it is useful to know that
+ Bidi domain names where the labels satisfy the rule have this
+ property.
+
+ In the following descriptions, first-level bullets are used to
+ indicate rules or normative statements; second-level bullets are
+ commentary.
+
+ The Character Grouping requirement can be more formally stated as:
+
+ o Let "Delimiterchars" be a set of characters with the Unicode Bidi
+ properties CS, WS, ON. (These are commonly used to delimit labels
+ -- both the FULL STOP and the space are included. They are not
+ allowed in domain labels.)
+
+ * ET, though it commonly occurs next to domain names in practice,
+ is problematic: the context R CS L EN ET (for instance A.a1%)
+ makes the label L EN not satisfy the character grouping
+ requirement.
+
+ * ES commonly occurs in labels as HYPHEN-MINUS, but could also be
+ used as a delimiter (for instance, the plus sign). It is left
+ out here.
+
+ o Let "unproblematic label" be a label that either satisfies the
+ requirements or does not contain any character with the Bidi
+ properties R, AL, or AN and does not begin with a character with
+ the Bidi property EN. (Informally, "it does not start with a
+ number".)
+
+
+
+
+
+
+
+Alvestrand & Karp Standards Track [Page 8]
+
+RFC 5893 IDNA Right to Left August 2010
+
+
+ A label X satisfies the Character Grouping requirement when, for any
+ Delimiter Character D1 and D2, and for any label S1 and S2 that is an
+ unproblematic label or an empty string, the following holds true:
+
+ If the string formed by concatenating S1, D1, X, D2, and S2 is
+ reordered according to the Bidi algorithm, then all the characters of
+ X in the reordered string are between D1 and D2, and no other
+ characters are between D1 and D2, both if the overall paragraph
+ direction is LTR and if the overall paragraph direction is RTL.
+
+ Note that the definition is self-referential, since S1 and S2 are
+ constrained to be "legal" by this definition. This makes testing
+ changes to proposed rules a little complex, but does not create
+ problems for testing whether or not a given proposed rule satisfies
+ the criterion.
+
+ The "zero-length" case represents the case where a domain name is
+ next to something that isn't a domain name, separated by a delimiter
+ character.
+
+ Note about the position of BN: The Unicode bidirectional algorithm
+ specifies that a BN has an effect on the adjoining characters in
+ network order, not in display order, and are therefore treated as if
+ removed during Bidi processing ([Unicode-UAX9], Section 3.3.2, rule
+ X9 and Section 5.3). Therefore, the question of "what position does
+ a BN have after reordering" is not meaningful. It has been ignored
+ while developing the rules here.
+
+ The Label Uniqueness requirement can be formally stated as:
+
+ If two non-identical labels X and Y, embedded as for the test above,
+ displayed in paragraphs with the same directionality, are reordered
+ by the Bidi algorithm into the same sequence of code points, the
+ labels X and Y cannot both be legal.
+
+4. Examples of Issues Found with RFC 3454
+
+4.1. Dhivehi
+
+ Dhivehi, the official language of the Maldives, is written with the
+ Thaana script. This script displays some of the characteristics of
+ the Arabic script, including its directional properties, and the
+ indication of vowels by the diacritical marking of consonantal base
+ characters. This marking is obligatory, and both two consecutive
+ vowels and syllable-final consonants are indicated with unvoiced
+ combining marks. Every Dhivehi word therefore ends with a combining
+ mark.
+
+
+
+
+Alvestrand & Karp Standards Track [Page 9]
+
+RFC 5893 IDNA Right to Left August 2010
+
+
+ The word for "computer", which is romanized as "konpeetaru", is
+ written with the following sequence of Unicode code points:
+
+ U+0786 THAANA LETTER KAAFU (AL)
+
+ U+07AE THAANA OBOFILI (NSM)
+
+ U+0782 THAANA LETTER NOONU (AL)
+
+ U+07B0 THAANA SUKUN (NSM)
+
+ U+0795 THAANA LETTER PAVIYANI (AL)
+
+ U+07A9 THAANA LETTER EEBEEFILI (AL)
+
+ U+0793 THAANA LETTER TAVIYANI (AL)
+
+ U+07A6 THAANA ABAFILI (NSM)
+
+ U+0783 THAANA LETTER RAA (AL)
+
+ U+07AA THAANA UBUFILI (NSM)
+
+ The directionality class of U+07AA in the Unicode database
+ [Unicode52] is NSM (Nonspacing Mark), which is not R or AL; a
+ conformant implementation of the IDNA2003 algorithm will say that
+ "this is not in RandALCat" and refuse to encode the string.
+
+4.2. Yiddish
+
+ Yiddish is one of several languages written with the Hebrew script
+ (others include Hebrew and Ladino). This is basically a consonantal
+ alphabet (also termed an "abjad"), but Yiddish is written using an
+ extended form that is fully vocalic. The vowels are indicated in
+ several ways, one of which is by repurposing letters that are
+ consonants in Hebrew. Other letters are used both as vowels and
+ consonants, with combining marks, called "points", used to
+ differentiate between them. Finally, some base characters can
+ indicate several different vowels, which are also disambiguated by
+ combining marks. Pointed characters can appear in word-final
+ position and may therefore also be needed at the end of labels. This
+ is not an invariable attribute of a Yiddish string and there is thus
+ greater latitude here than there is with Dhivehi.
+
+ The organization now known as the "YIVO Institute for Jewish
+ Research" developed orthographic rules for modern Standard Yiddish
+ during the 1930s on the basis of work conducted in several venues
+ since earlier in that century. These are given in, "The Standardized
+
+
+
+Alvestrand & Karp Standards Track [Page 10]
+
+RFC 5893 IDNA Right to Left August 2010
+
+
+ Yiddish Orthography: Rules of Yiddish Spelling" [SYO], and are taken
+ as normatively descriptive of modern Standard Yiddish in any context
+ where that notion is deemed relevant. They have been applied
+ exclusively in all formal Yiddish dictionaries published since their
+ establishment, and are similarly dominant in academic and
+ bibliographic regards.
+
+ It therefore appears appropriate for this repertoire also to be
+ supported fully by IDNA. This presents no difficulty with characters
+ in initial and medial positions, but pointed characters are regularly
+ used in final position as well. All of the characters in the SYO
+ repertoire appear in both marked and unmarked form with one
+ exception: the HEBREW LETTER PE (U+05E4). The SYO only permits this
+ with a HEBREW POINT DAGESH (U+05BC), providing the Yiddish equivalent
+ to the Latin letter "p", or a HEBREW POINT RAFE (U+05BF), equivalent
+ to the Latin letter "f". There is, however, a separate unpointed
+ allograph, the HEBREW LETTER FINAL PE (U+05E3), for the latter
+ character when it appears in final position. The constraint on the
+ use of the SYO repertoire resulting from the proscription of
+ combining marks at the end of RTL strings thus reduces to nothing
+ more, or less, than the equivalent of saying that a string of Latin
+ characters cannot end with the letter "p". It must also be noted
+ that the HEBREW LETTER PE with the HEBREW POINT DAGESH is
+ characteristic of almost all traditional Yiddish orthographies that
+ predate (or remain in use in parallel to) the SYO, being the first
+ pointed character to appear in any of them.
+
+ A more general instantiation of the basic problem can be seen in the
+ representation of the YIVO acronym. This acronym is written with the
+ Hebrew letters YOD YOD HIRIQ VAV VAV ALEF QAMATS, where HIRIQ and
+ QAMATS are combining points. The Unicode code points are:
+
+ U+05D9 HEBREW LETTER YOD (R)
+
+ U+05B4 HEBREW POINT HIRIQ (NSM)
+
+ U+05D5 HEBREW LETTER VAV (R)
+
+ U+05D0 HEBREW LETTER ALEF (R)
+
+ U+05B8 HEBREW POINT QAMATS (NSM)
+
+ The directionality class of U+05B8 HEBREW POINT QAMATS in the Unicode
+ database is NSM, which again causes the IDNA2003 algorithm to reject
+ the string.
+
+
+
+
+
+
+Alvestrand & Karp Standards Track [Page 11]
+
+RFC 5893 IDNA Right to Left August 2010
+
+
+ It may also be noted that all of the combined characters mentioned
+ above exist in precomposed form at separate positions in the Unicode
+ chart. However, by invoking Stringprep, the IDNA2003 algorithm also
+ rejects those code points, for reasons not discussed here.
+
+4.3. Strings with Numbers
+
+ By requiring that the first or last character of a string be a member
+ of category R or AL, the Stringprep specification [RFC3454]
+ prohibited a string containing right-to-left characters from ending
+ with a number.
+
+ Consider the strings ALEF 5 (HEBREW LETTER ALEF + DIGIT FIVE) and 5
+ ALEF. Displayed in an LTR context, the first one will be displayed
+ from left to right as 5 ALEF (with the 5 being considered right to
+ left because of the leading ALEF), while 5 ALEF will be displayed in
+ exactly the same order (5 taking the direction from context).
+ Clearly, only one of those should be permitted as a registered label,
+ but barring them both seems unnecessary.
+
+5. Troublesome Situations and Guidelines
+
+ There are situations in which labels that satisfy the rule above will
+ be displayed in a surprising fashion. The most important of these is
+ the case where a label ending in a character with Bidi property AL,
+ AN, or R occurs before a label beginning with a character of Bidi
+ property EN. In that case, the number will appear to move into the
+ label containing the right-to-left character, violating the Character
+ Grouping requirement.
+
+ If the label that occurs after the right-to-left label itself
+ satisfies the Bidi criterion, the requirements will be satisfied in
+ all cases (this is the reason why the criterion talks about strings
+ containing L in some cases). However, the IDNABIS WG concluded that
+ this could not be required for several reasons:
+
+ o There is a large current deployment of ASCII domain names starting
+ with digits. These cannot possibly be invalidated.
+
+ o Domain names are often constructed piecemeal, for instance, by
+ combining a string with the content of a search list. This may
+ occur after IDNA processing, and thus in part of the code that is
+ not IDNA-aware, making detection of the undesirable combination
+ impossible.
+
+
+
+
+
+
+
+Alvestrand & Karp Standards Track [Page 12]
+
+RFC 5893 IDNA Right to Left August 2010
+
+
+ o Even if a label is registered under a "safe" label, there may be a
+ DNAME [RFC2672] with an "unsafe" label that points to the "safe"
+ label, thus creating seemingly valid names that would not satisfy
+ the criterion.
+
+ o Wildcards create the odd situation where a label is "valid" (can
+ be looked up successfully) without the zone owner knowing that
+ this label exists. So an owner of a zone whose name starts with a
+ digit and contains a wildcard has no way of controlling whether or
+ not names with RTL labels in them are looked up in his zone.
+
+ Rather than trying to suggest rules that disallow all such
+ undesirable situations, this document merely warns about the
+ possibility, and leaves it to application developers to take whatever
+ measures they deem appropriate to avoid problematic situations.
+
+6. Other Issues in Need of Resolution
+
+ This document concerns itself only with the rules that are needed
+ when dealing with domain names with characters that have differing
+ Bidi properties, and considers characters only in terms of their Bidi
+ properties. All other issues with scripts that are written from
+ right to left must be considered in other contexts.
+
+ One such issue is the need to keep numbers separate. Several scripts
+ are used with multiple sets of numbers -- most commonly they use
+ Latin numbers and a script-specific set of numbers, but in the case
+ of Arabic, there are two sets of "Arabic-Indic" digits involved.
+
+ The algorithm in this document disallows occurrences of AN-class
+ characters ("Arabic-Indic digits", U+0660 to U+0669) together with
+ EN-class characters (which includes "European" digits, U+0030 to
+ U+0039 and "extended Arabic-Indic digits", U+06F0 to U+06F9), but
+ does not help in preventing the mixing of, for instance, Bengali
+ digits (U+09E6 to U+09EF) and Gujarati digits (U+0AE6 to U+0AEF),
+ both of which have Bidi class L. A registry or script community that
+ wishes to create rules restricting the mixing of digits in a label
+ will be able to specify these restrictions at the registry level.
+ Some rules are also specified at the protocol level.
+
+ Another set of issues concerns the proper display of IDNs with a
+ mixture of LTR and RTL labels, or only RTL labels.
+
+ It is unrealistic to expect that applications will display domain
+ names using embedded formatting codes between their labels (for one
+ thing, no reliable algorithms for identifying domain names in running
+ text exist); thus, the display order will be determined by the Bidi
+ algorithm. Thus, a sequence (in network order) of R1.R2.ltr will be
+
+
+
+Alvestrand & Karp Standards Track [Page 13]
+
+RFC 5893 IDNA Right to Left August 2010
+
+
+ displayed in the order 2R.1R.ltr in an LTR context, which might
+ surprise someone expecting to see labels displayed in hierarchical
+ order. People used to working with text that mixes LTR and RTL
+ strings might not be so surprised by this. Again, this memo does not
+ attempt to suggest a solution to this problem.
+
+7. Compatibility Considerations
+
+7.1. Backwards Compatibility Considerations
+
+ As with any change to an existing standard, it is important to
+ consider what happens with existing implementations when the change
+ is introduced. Some troublesome cases include:
+
+ o An old program used to input the newly allowed label. If the old
+ program checks the input against RFC 3454, some labels will not be
+ allowed, and domain names containing those labels will remain
+ inaccessible.
+
+ o An old program is asked to display the newly allowed label, and
+ checks it against RFC 3454 before displaying. The program will
+ perform some kind of fallback, most likely displaying the label in
+ A-label form.
+
+ o An old program tries to display the newly allowed label. If the
+ old program has code for displaying the last character of a label
+ that is different from the code used to display the characters in
+ the middle of the label, the display may be inconsistent and cause
+ confusion.
+
+ One particular example of the last case is if a program chooses to
+ examine the last character (in network order) of a string in order to
+ determine its directionality, rather than its first. If it finds an
+ NSM character and tries to display the string as if it was a
+ left-to-right string, the resulting display may be interesting, but
+ not useful.
+
+ The editors believe that these cases will have a less harmful impact
+ in practice than continuing to deny the use of words from the
+ languages for which these strings are necessary as IDN labels.
+
+ This specification does not forbid using leading European digits in
+ ASCII-only labels, since this would conflict with a large installed
+ base of such labels, and would increase the scope of the
+ specification from RTL labels to all labels. The harm resulting from
+ this limitation of scope is described in Section 5. Registries and
+ private zone managers can check for this particular condition before
+ they allow registration of any RTL label. Generally, it is best to
+
+
+
+Alvestrand & Karp Standards Track [Page 14]
+
+RFC 5893 IDNA Right to Left August 2010
+
+
+ disallow registration of any right-to-left strings in a zone where
+ the label at the level above begins with a digit.
+
+7.2. Forward Compatibility Considerations
+
+ This text is intentionally specified strictly in terms of the Unicode
+ Bidi properties. The determination that the condition is sufficient
+ to fulfill the criteria depends on the Unicode Bidi algorithm; it is
+ unlikely that drastic changes will be made to this algorithm.
+
+ However, the determination of validity for any string depends on the
+ Unicode Bidi property values, which are not declared immutable by the
+ Unicode Consortium. Furthermore, the behavior of the algorithm for
+ any given character is likely to be linguistically and culturally
+ sensitive, so while it should occur rarely, it is possible that later
+ versions of the Unicode Standard may change the Bidi properties
+ assigned to certain Unicode characters.
+
+ This memo does not propose a solution for this problem.
+
+8. Security Considerations
+
+ The display behavior of mixed-direction text can be extremely
+ surprising to users who are not used to it; for instance, cut and
+ paste of a piece of text can cause the text to display differently at
+ the destination, if the destination is in another directionality
+ context, and adding a character in one place of a text can cause
+ characters some distance from the point of insertion to change their
+ display position. This is, however, not a phenomenon unique to the
+ display of domain names.
+
+ The new IDNA protocol, and particularly these new Bidi rules, will
+ allow some strings to be used in IDNA contexts that are not allowed
+ today. It is possible that differences in the interpretation of
+ labels between implementations of IDNA2003 and IDNA2008 could pose a
+ security risk, but it is difficult to envision any specific
+ instantiation of this.
+
+ Any rational attempt to compute, for instance, a hash over an
+ identifier processed by IDNA would use network order for its
+ computation, and thus be unaffected by the new rules proposed here.
+
+ While it is not believed to pose a problem, if display routines had
+ been written with specific knowledge of the RFC 3454 IDNA
+ prohibitions, it is possible that the potential problems noted under
+ "Backwards Compatibility Considerations" could cause new kinds of
+ confusion.
+
+
+
+
+Alvestrand & Karp Standards Track [Page 15]
+
+RFC 5893 IDNA Right to Left August 2010
+
+
+9. Acknowledgements
+
+ While the listed editors held the pen, this document represents the
+ joint work and conclusions of an ad hoc design team. In addition to
+ the editors, this consisted of, in alphabetic order, Tina Dam, Patrik
+ Faltstrom, and John Klensin. Many further specific contributions and
+ helpful comments were received from the people listed below, and
+ others who have contributed to the development and use of the IDNA
+ protocols.
+
+ The particular formulation of the Bidi rule in Section 2 was
+ suggested by Matitiahu Allouche.
+
+ The team wishes, in particular, to thank Roozbeh Pournader for
+ calling its attention to the issue with the Thaana script, Paul
+ Hoffman for pointing out the need to be explicit about backwards
+ compatibility considerations, Ken Whistler for suggesting the basis
+ of the formalized "Character Grouping" requirement, Mark Davis for
+ commentary, Erik van der Poel for careful review, comments, and
+ verification of the rulesets, Marcos Sanz, Andrew Sullivan, and Pete
+ Resnick for reviews, and Vint Cerf for chairing the working group and
+ contributing massively to getting the documents finished.
+
+10. References
+
+10.1. Normative References
+
+ [RFC5890] Klensin, J., "Internationalized Domain Names for
+ Applications (IDNA): Definitions and Document
+ Framework", RFC 5890, August 2010.
+
+ [Unicode-UAX9] The Unicode Consortium, "Unicode Standard Annex #9:
+ Unicode Bidirectional Algorithm", September 2009,
+ <http://www.unicode.org/reports/tr9/>.
+
+ [Unicode52] The Unicode Consortium. The Unicode Standard, Version
+ 5.2.0, defined by: "The Unicode Standard, Version
+ 5.2.0", (Mountain View, CA: The Unicode Consortium,
+ 2009. ISBN 978-1-936213-00-9).
+ <http://www.unicode.org/versions/Unicode5.2.0/>.
+
+
+
+
+
+
+
+
+
+
+
+Alvestrand & Karp Standards Track [Page 16]
+
+RFC 5893 IDNA Right to Left August 2010
+
+
+10.2. Informative References
+
+ [RFC2672] Crawford, M., "Non-Terminal DNS Name Redirection",
+ RFC 2672, August 1999.
+
+ [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of
+ Internationalized Strings ("stringprep")", RFC 3454,
+ December 2002.
+
+ [RFC5891] Klensin, J., "Internationalized Domain Names in
+ Applications (IDNA): Protocol", RFC 5891, August 2010.
+
+ [SYO] "The Standardized Yiddish Orthography: Rules of
+ Yiddish Spelling, 6th ed., New York, ISBN
+ 0-914512-25-0", 1999.
+
+Authors' Addresses
+
+ Harald Tveit Alvestrand (editor)
+ Google
+ Beddingen 10
+ Trondheim, 7014
+ Norway
+
+ EMail: harald@alvestrand.no
+
+
+ Cary Karp
+ Swedish Museum of Natural History
+ Frescativ. 40
+ Stockholm, 10405
+ Sweden
+
+ Phone: +46 8 5195 4055
+ Fax:
+ EMail: ck@nic.museum
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Alvestrand & Karp Standards Track [Page 17]
+