doc: Add RFC documents

author: Thomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
committer: Thomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
commit: 4bfd864f10b68b71482b35c818559068ef8d5797 (patch)
tree: e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc5893.txt
parent: ea76e11061bda059ae9f9ad130a9895cc85607db (diff)
1 files changed, 955 insertions, 0 deletions
diff --git a/doc/rfc/rfc5893.txt b/doc/rfc/rfc5893.txt
new file mode 100644
index 0000000..c76dfba
--- /dev/null
+++ b/doc/rfc/rfc5893.txt
@@ -0,0 +1,955 @@
+
+
+
+
+
+
+Internet Engineering Task Force (IETF)                H. Alvestrand, Ed.
+Request for Comments: 5893                                        Google
+Category: Standards Track                                        C. Karp
+ISSN: 2070-1721                        Swedish Museum of Natural History
+                                                             August 2010
+
+
+                       Right-to-Left Scripts for
+         Internationalized Domain Names for Applications (IDNA)
+
+Abstract
+
+   The use of right-to-left scripts in Internationalized Domain Names
+   (IDNs) has presented several challenges.  This memo provides a new
+   Bidi rule for Internationalized Domain Names for Applications (IDNA)
+   labels, based on the encountered problems with some scripts and some
+   shortcomings in the 2003 IDNA Bidi criterion.
+
+Status of This Memo
+
+   This is an Internet Standards Track document.
+
+   This document is a product of the Internet Engineering Task Force
+   (IETF).  It represents the consensus of the IETF community.  It has
+   received public review and has been approved for publication by the
+   Internet Engineering Steering Group (IESG).  Further information on
+   Internet Standards is available in Section 2 of RFC 5741.
+
+   Information about the current status of this document, any errata,
+   and how to provide feedback on it may be obtained at
+   http://www.rfc-editor.org/info/rfc5893.
+
+Copyright Notice
+
+   Copyright (c) 2010 IETF Trust and the persons identified as the
+   document authors.  All rights reserved.
+
+   This document is subject to BCP 78 and the IETF Trust's Legal
+   Provisions Relating to IETF Documents
+   (http://trustee.ietf.org/license-info) in effect on the date of
+   publication of this document.  Please review these documents
+   carefully, as they describe your rights and restrictions with respect
+   to this document.  Code Components extracted from this document must
+   include Simplified BSD License text as described in Section 4.e of
+   the Trust Legal Provisions and are provided without warranty as
+   described in the Simplified BSD License.
+
+
+
+
+
+Alvestrand & Karp            Standards Track                    [Page 1]
+
+RFC 5893                   IDNA Right to Left                August 2010
+
+
+Table of Contents
+
+   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  2
+     1.1.  Purpose and Applicability  . . . . . . . . . . . . . . . .  2
+     1.2.  Background and History . . . . . . . . . . . . . . . . . .  3
+     1.3.  Structure of the Rest of This Document . . . . . . . . . .  3
+     1.4.  Terminology  . . . . . . . . . . . . . . . . . . . . . . .  4
+   2.  The Bidi Rule  . . . . . . . . . . . . . . . . . . . . . . . .  6
+   3.  The Requirement Set for the Bidi Rule  . . . . . . . . . . . .  6
+   4.  Examples of Issues Found with RFC 3454 . . . . . . . . . . . .  9
+     4.1.  Dhivehi  . . . . . . . . . . . . . . . . . . . . . . . . .  9
+     4.2.  Yiddish  . . . . . . . . . . . . . . . . . . . . . . . . . 10
+     4.3.  Strings with Numbers . . . . . . . . . . . . . . . . . . . 12
+   5.  Troublesome Situations and Guidelines  . . . . . . . . . . . . 12
+   6.  Other Issues in Need of Resolution . . . . . . . . . . . . . . 13
+   7.  Compatibility Considerations . . . . . . . . . . . . . . . . . 14
+     7.1.  Backwards Compatibility Considerations . . . . . . . . . . 14
+     7.2.  Forward Compatibility Considerations . . . . . . . . . . . 15
+   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 15
+   9.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 16
+   10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16
+     10.1. Normative References . . . . . . . . . . . . . . . . . . . 16
+     10.2. Informative References . . . . . . . . . . . . . . . . . . 17
+
+1.  Introduction
+
+1.1.  Purpose and Applicability
+
+   The purpose of this document is to establish a rule that can be
+   applied to Internationalized Domain Name (IDN) labels in Unicode form
+   (U-labels) containing characters from scripts that are written from
+   right to left.  It is part of the revised IDNA protocol [RFC5891].
+
+   When labels satisfy the rule, and when certain other conditions are
+   satisfied, there is only a minimal chance of these labels being
+   displayed in a confusing way by the Unicode bidirectional display
+   algorithm.
+
+   The other normative documents in the IDNA2008 document set establish
+   criteria for valid labels, including listing the permitted
+   characters.  This document establishes additional validity criteria
+   for labels in scripts normally written from right to left.
+
+   This specification is not intended to place any requirements on
+   domain names that do not contain characters from such scripts.
+
+
+
+
+
+
+Alvestrand & Karp            Standards Track                    [Page 2]
+
+RFC 5893                   IDNA Right to Left                August 2010
+
+
+1.2.  Background and History
+
+   The "Stringprep" specification [RFC3454], part of IDNA2003, made the
+   following statement in its Section 6 on the Bidi algorithm:
+
+      3) If a string contains any RandALCat character, a RandALCat
+      character MUST be the first character of the string, and a
+      RandALCat character MUST be the last character of the string.
+
+   (A RandALCat character is a character with unambiguously
+   right-to-left directionality.)
+
+   The reasoning behind this prohibition was to ensure that every
+   component of a displayed domain name has an unambiguously preferred
+   direction.  However, this made certain words in languages written
+   with right-to-left scripts invalid as IDN labels, and in at least one
+   case (Dhivehi) meant that all the words of an entire language were
+   forbidden as IDN labels.
+
+   This is illustrated below with examples taken from the Dhivehi and
+   Yiddish languages, as written with the Thaana and Hebrew scripts,
+   respectively.
+
+   RFC 3454 did not explicitly state the requirement to be fulfilled.
+   Therefore, it is impossible to determine whether a simple relaxation
+   of the rule would continue to fulfill the requirement.
+
+   While this document specifies rules quite different from RFC 3454,
+   most reasonable labels that were allowed under RFC 3454 will also be
+   allowed under this specification (the most important example of
+   non-permitted labels being labels that mix Arabic and European digits
+   (AN and EN) inside an RTL label, and labels that use AN in an LTR
+   label -- see Section 1.4 for terminology), so the operational impact
+   of using the new rule in the updated IDNA specification is limited.
+
+1.3.  Structure of the Rest of This Document
+
+   Section 2 defines a rule, the "Bidi rule", which can be used on a
+   domain name label to check how safe it is to use in a domain name of
+   possibly mixed directionality.  The primary initial use of this rule
+   is as part of the IDNA2008 protocol [RFC5891].
+
+   Section 3 sets out the requirements for defining the Bidi rule.
+
+   Section 4 gives detailed examples that serve as justification for the
+   new rule.
+
+
+
+
+
+Alvestrand & Karp            Standards Track                    [Page 3]
+
+RFC 5893                   IDNA Right to Left                August 2010
+
+
+   Section 5 to Section 8 describe various situations that can occur
+   when dealing with domain names with characters of different
+   directionality.
+
+   Only Section 1.4 and Section 2 are normative.
+
+1.4.  Terminology
+
+   The terminology used to describe IDNA concepts is defined in the
+   Definitions document [RFC5890].
+
+   The terminology used for the Bidi properties of Unicode characters is
+   taken from the Unicode Standard [Unicode52].
+
+   The Unicode Standard specifies a Bidi property for each character.
+   That property controls the character's behavior in the Unicode
+   bidirectional algorithm [Unicode-UAX9].  For reference, here are the
+   values that the Unicode Bidi property can have:
+
+   o  L - Left to right - most letters in LTR scripts
+
+   o  R - Right to left - most letters in non-Arabic RTL scripts
+
+   o  AL - Arabic letters - most letters in the Arabic script
+
+   o  EN - European Number (0-9, and Extended Arabic-Indic numbers)
+
+   o  ES - European Number Separator (+ and -)
+
+   o  ET - European Number Terminator (currency symbols, the hash sign,
+      the percent sign and so on)
+
+   o  AN - Arabic Number; this encompasses the Arabic-Indic numbers, but
+      not the Extended Arabic-Indic numbers
+
+   o  CS - Common Number Separator (. , / : et al)
+
+   o  NSM - Nonspacing Mark - most combining accents
+
+   o  BN - Boundary Neutral - control characters (ZWNJ, ZWJ, and others)
+
+   o  B - Paragraph Separator
+
+   o  S - Segment Separator
+
+   o  WS - Whitespace, including the SPACE character
+
+   o  ON - Other Neutrals, including @, &, parentheses, MIDDLE DOT
+
+
+
+Alvestrand & Karp            Standards Track                    [Page 4]
+
+RFC 5893                   IDNA Right to Left                August 2010
+
+
+   o  LRE, LRO, RLE, RLO, PDF - these are "directional control
+      characters" and are not used in IDNA labels.
+
+   In this memo, we use "network order" to describe the sequence of
+   characters as transmitted on the wire or stored in a file; the terms
+   "first", "next", "previous", "beginning", "end", "before", and
+   "after" are used to refer to the relationship of characters and
+   labels in network order.
+
+   We use "display order" to talk about the sequence of characters as
+   imaged on a display medium; the terms "left" and "right" are used to
+   refer to the relationship of characters and labels in display order.
+
+   Most of the time, the examples use the abbreviations for the Unicode
+   Bidi classes to denote the directionality of the characters; the
+   example string CS L consists of one character of class CS and one
+   character of class L.  In some examples, the convention that
+   uppercase characters are of class R or AL, and lowercase characters
+   are of class L is used -- thus, the example string ABC.abc would
+   consist of three right-to-left characters and three left-to-right
+   characters.
+
+   The directionality of such examples is determined by context -- for
+   instance, in the sentence "ABC.abc is displayed as CBA.abc", the
+   first example string is in network order, the second example string
+   is in display order.
+
+   The term "paragraph" is used in the sense of the Unicode Bidi
+   specification [Unicode-UAX9].  It means "a block of text that has an
+   overall direction, either left to right or right to left",
+   approximately; see the "Unicode Bidirectional Algorithm"
+   [Unicode-UAX9] for details.
+
+   "RTL" and "LTR" are abbreviations for "right to left" and "left to
+   right", respectively.
+
+   An RTL label is a label that contains at least one character of type
+   R, AL, or AN.
+
+   An LTR label is any label that is not an RTL label.
+
+   A "Bidi domain name" is a domain name that contains at least one RTL
+   label.  (Note: This definition includes domain names containing only
+   dots and right-to-left characters.  Providing a separate category of
+   "RTL domain names" would not make this specification simpler, so it
+   has not been done.)
+
+
+
+
+
+Alvestrand & Karp            Standards Track                    [Page 5]
+
+RFC 5893                   IDNA Right to Left                August 2010
+
+
+2.  The Bidi Rule
+
+   The following rule, consisting of six conditions, applies to labels
+   in Bidi domain names.  The requirements that this rule satisfies are
+   described in Section 3.  All of the conditions must be satisfied for
+   the rule to be satisfied.
+
+   1.  The first character must be a character with Bidi property L, R,
+       or AL.  If it has the R or AL property, it is an RTL label; if it
+       has the L property, it is an LTR label.
+
+   2.  In an RTL label, only characters with the Bidi properties R, AL,
+       AN, EN, ES, CS, ET, ON, BN, or NSM are allowed.
+
+   3.  In an RTL label, the end of the label must be a character with
+       Bidi property R, AL, EN, or AN, followed by zero or more
+       characters with Bidi property NSM.
+
+   4.  In an RTL label, if an EN is present, no AN may be present, and
+       vice versa.
+
+   5.  In an LTR label, only characters with the Bidi properties L, EN,
+       ES, CS, ET, ON, BN, or NSM are allowed.
+
+   6.  In an LTR label, the end of the label must be a character with
+       Bidi property L or EN, followed by zero or more characters with
+       Bidi property NSM.
+
+   The following guarantees can be made based on the above:
+
+   o  In a domain name consisting of only labels that satisfy the rule,
+      the requirements of Section 3 are satisfied.  Note that even LTR
+      labels and pure ASCII labels have to be tested.
+
+   o  In a domain name consisting of only LDH labels (as defined in the
+      Definitions document [RFC5890]) and labels that satisfy the rule,
+      the requirements of Section 3 are satisfied as long as a label
+      that starts with an ASCII digit does not come after a
+      right-to-left label.
+
+   No guarantee is given for other combinations.
+
+3.  The Requirement Set for the Bidi Rule
+
+   This document, unlike RFC 3454 [RFC3454], provides an explicit
+   justification for the Bidi rule, and states a set of requirements for
+   which it is possible to test whether or not the modified rule
+   fulfills the requirement.
+
+
+
+Alvestrand & Karp            Standards Track                    [Page 6]
+
+RFC 5893                   IDNA Right to Left                August 2010
+
+
+   All the text in this document assumes that text containing the labels
+   under consideration will be displayed using the Unicode bidirectional
+   algorithm [Unicode-UAX9].
+
+   The requirements proposed are these:
+
+   o  Label Uniqueness: No two labels, when presented in display order
+      in the same paragraph, should have the same sequence of characters
+      without also having the same sequence of characters in network
+      order, both when the paragraph has LTR direction and when the
+      paragraph has RTL direction.  (This is the criterion that is
+      explicit in RFC 3454).  (Note that a label displayed in an RTL
+      paragraph may display the same as a different label displayed in
+      an LTR paragraph and still satisfy this criterion.)
+
+   o  Character Grouping: When displaying a string of labels, using the
+      Unicode Bidi algorithm to reorder the characters for display, the
+      characters of each label should remain grouped between the
+      characters delimiting the labels, both when the string is embedded
+      in a paragraph with LTR direction and when it is embedded in a
+      paragraph with RTL direction.
+
+   Several stronger statements were considered and rejected, because
+   they seem to be impossible to fulfill within the constraints of the
+   Unicode bidirectional algorithm.  These include:
+
+   o  The appearance of a label should be unaffected by its embedding
+      context.  This proved impossible even for ASCII labels; the label
+      "123-A" will have a different display order in an RTL context than
+      in an LTR context.  (This particular example is, however,
+      disallowed anyway.)
+
+   o  The sequence of labels should be consistent with network order.
+      This proved impossible -- a domain name consisting of the labels
+      (in network order) L1.R2.R3.L4 will be displayed as L1.R3.R2.L4 in
+      an LTR context.  (In an RTL context, it will be displayed as
+      L4.R3.R2.L1).
+
+   o  No two domain names should be displayed the same, even under
+      differing directionality.  This was shown to be unsound, since the
+      domain name (in network order) ABC.abc will have display order
+      CBA.abc in an LTR context and abc.CBA in an RTL context, while the
+      domain name (network) abc.ABC will have display order abc.CBA in
+      an LTR context and CBA.abc in an RTL context.
+
+
+
+
+
+
+
+Alvestrand & Karp            Standards Track                    [Page 7]
+
+RFC 5893                   IDNA Right to Left                August 2010
+
+
+   One possible requirement was thought to be problematic, but turned
+   out to be satisfied by a string that obeys the proposed rules:
+
+   o  The Character Grouping requirement should be satisfied when
+      directional controls (LRE, RLE, RLO, LRO, PDF) are used in the
+      same paragraph (outside of the labels).  Because these controls
+      affect presentation order in non-obvious ways, by affecting the
+      "sor" and "eor" properties of the Unicode Bidi algorithm, the
+      conditions above require extra testing in order to figure out
+      whether or not they influence the display of the domain name.
+      Testing found that for the strings allowed under the rule
+      presented in this document, directional controls do not influence
+      the display of the domain name.
+
+   This is still not stated as a requirement, since it did not seem as
+   important as the stated requirements, but it is useful to know that
+   Bidi domain names where the labels satisfy the rule have this
+   property.
+
+   In the following descriptions, first-level bullets are used to
+   indicate rules or normative statements; second-level bullets are
+   commentary.
+
+   The Character Grouping requirement can be more formally stated as:
+
+   o  Let "Delimiterchars" be a set of characters with the Unicode Bidi
+      properties CS, WS, ON.  (These are commonly used to delimit labels
+      -- both the FULL STOP and the space are included.  They are not
+      allowed in domain labels.)
+
+      *  ET, though it commonly occurs next to domain names in practice,
+         is problematic: the context R CS L EN ET (for instance A.a1%)
+         makes the label L EN not satisfy the character grouping
+         requirement.
+
+      *  ES commonly occurs in labels as HYPHEN-MINUS, but could also be
+         used as a delimiter (for instance, the plus sign).  It is left
+         out here.
+
+   o  Let "unproblematic label" be a label that either satisfies the
+      requirements or does not contain any character with the Bidi
+      properties R, AL, or AN and does not begin with a character with
+      the Bidi property EN.  (Informally, "it does not start with a
+      number".)
+
+
+
+
+
+
+
+Alvestrand & Karp            Standards Track                    [Page 8]
+
+RFC 5893                   IDNA Right to Left                August 2010
+
+
+   A label X satisfies the Character Grouping requirement when, for any
+   Delimiter Character D1 and D2, and for any label S1 and S2 that is an
+   unproblematic label or an empty string, the following holds true:
+
+   If the string formed by concatenating S1, D1, X, D2, and S2 is
+   reordered according to the Bidi algorithm, then all the characters of
+   X in the reordered string are between D1 and D2, and no other
+   characters are between D1 and D2, both if the overall paragraph
+   direction is LTR and if the overall paragraph direction is RTL.
+
+   Note that the definition is self-referential, since S1 and S2 are
+   constrained to be "legal" by this definition.  This makes testing
+   changes to proposed rules a little complex, but does not create
+   problems for testing whether or not a given proposed rule satisfies
+   the criterion.
+
+   The "zero-length" case represents the case where a domain name is
+   next to something that isn't a domain name, separated by a delimiter
+   character.
+
+   Note about the position of BN: The Unicode bidirectional algorithm
+   specifies that a BN has an effect on the adjoining characters in
+   network order, not in display order, and are therefore treated as if
+   removed during Bidi processing ([Unicode-UAX9], Section 3.3.2, rule
+   X9 and Section 5.3).  Therefore, the question of "what position does
+   a BN have after reordering" is not meaningful.  It has been ignored
+   while developing the rules here.
+
+   The Label Uniqueness requirement can be formally stated as:
+
+   If two non-identical labels X and Y, embedded as for the test above,
+   displayed in paragraphs with the same directionality, are reordered
+   by the Bidi algorithm into the same sequence of code points, the
+   labels X and Y cannot both be legal.
+
+4.  Examples of Issues Found with RFC 3454
+
+4.1.  Dhivehi
+
+   Dhivehi, the official language of the Maldives, is written with the
+   Thaana script.  This script displays some of the characteristics of
+   the Arabic script, including its directional properties, and the
+   indication of vowels by the diacritical marking of consonantal base
+   characters.  This marking is obligatory, and both two consecutive
+   vowels and syllable-final consonants are indicated with unvoiced
+   combining marks.  Every Dhivehi word therefore ends with a combining
+   mark.
+
+
+
+
+Alvestrand & Karp            Standards Track                    [Page 9]
+
+RFC 5893                   IDNA Right to Left                August 2010
+
+
+   The word for "computer", which is romanized as "konpeetaru", is
+   written with the following sequence of Unicode code points:
+
+      U+0786 THAANA LETTER KAAFU (AL)
+
+      U+07AE THAANA OBOFILI (NSM)
+
+      U+0782 THAANA LETTER NOONU (AL)
+
+      U+07B0 THAANA SUKUN (NSM)
+
+      U+0795 THAANA LETTER PAVIYANI (AL)
+
+      U+07A9 THAANA LETTER EEBEEFILI (AL)
+
+      U+0793 THAANA LETTER TAVIYANI (AL)
+
+      U+07A6 THAANA ABAFILI (NSM)
+
+      U+0783 THAANA LETTER RAA (AL)
+
+      U+07AA THAANA UBUFILI (NSM)
+
+   The directionality class of U+07AA in the Unicode database
+   [Unicode52] is NSM (Nonspacing Mark), which is not R or AL; a
+   conformant implementation of the IDNA2003 algorithm will say that
+   "this is not in RandALCat" and refuse to encode the string.
+
+4.2.  Yiddish
+
+   Yiddish is one of several languages written with the Hebrew script
+   (others include Hebrew and Ladino).  This is basically a consonantal
+   alphabet (also termed an "abjad"), but Yiddish is written using an
+   extended form that is fully vocalic.  The vowels are indicated in
+   several ways, one of which is by repurposing letters that are
+   consonants in Hebrew.  Other letters are used both as vowels and
+   consonants, with combining marks, called "points", used to
+   differentiate between them.  Finally, some base characters can
+   indicate several different vowels, which are also disambiguated by
+   combining marks.  Pointed characters can appear in word-final
+   position and may therefore also be needed at the end of labels.  This
+   is not an invariable attribute of a Yiddish string and there is thus
+   greater latitude here than there is with Dhivehi.
+
+   The organization now known as the "YIVO Institute for Jewish
+   Research" developed orthographic rules for modern Standard Yiddish
+   during the 1930s on the basis of work conducted in several venues
+   since earlier in that century.  These are given in, "The Standardized
+
+
+
+Alvestrand & Karp            Standards Track                   [Page 10]
+
+RFC 5893                   IDNA Right to Left                August 2010
+
+
+   Yiddish Orthography: Rules of Yiddish Spelling" [SYO], and are taken
+   as normatively descriptive of modern Standard Yiddish in any context
+   where that notion is deemed relevant.  They have been applied
+   exclusively in all formal Yiddish dictionaries published since their
+   establishment, and are similarly dominant in academic and
+   bibliographic regards.
+
+   It therefore appears appropriate for this repertoire also to be
+   supported fully by IDNA.  This presents no difficulty with characters
+   in initial and medial positions, but pointed characters are regularly
+   used in final position as well.  All of the characters in the SYO
+   repertoire appear in both marked and unmarked form with one
+   exception: the HEBREW LETTER PE (U+05E4).  The SYO only permits this
+   with a HEBREW POINT DAGESH (U+05BC), providing the Yiddish equivalent
+   to the Latin letter "p", or a HEBREW POINT RAFE (U+05BF), equivalent
+   to the Latin letter "f".  There is, however, a separate unpointed
+   allograph, the HEBREW LETTER FINAL PE (U+05E3), for the latter
+   character when it appears in final position.  The constraint on the
+   use of the SYO repertoire resulting from the proscription of
+   combining marks at the end of RTL strings thus reduces to nothing
+   more, or less, than the equivalent of saying that a string of Latin
+   characters cannot end with the letter "p".  It must also be noted
+   that the HEBREW LETTER PE with the HEBREW POINT DAGESH is
+   characteristic of almost all traditional Yiddish orthographies that
+   predate (or remain in use in parallel to) the SYO, being the first
+   pointed character to appear in any of them.
+
+   A more general instantiation of the basic problem can be seen in the
+   representation of the YIVO acronym.  This acronym is written with the
+   Hebrew letters YOD YOD HIRIQ VAV VAV ALEF QAMATS, where HIRIQ and
+   QAMATS are combining points.  The Unicode code points are:
+
+      U+05D9 HEBREW LETTER YOD (R)
+
+      U+05B4 HEBREW POINT HIRIQ (NSM)
+
+      U+05D5 HEBREW LETTER VAV (R)
+
+      U+05D0 HEBREW LETTER ALEF (R)
+
+      U+05B8 HEBREW POINT QAMATS (NSM)
+
+   The directionality class of U+05B8 HEBREW POINT QAMATS in the Unicode
+   database is NSM, which again causes the IDNA2003 algorithm to reject
+   the string.
+
+
+
+
+
+
+Alvestrand & Karp            Standards Track                   [Page 11]
+
+RFC 5893                   IDNA Right to Left                August 2010
+
+
+   It may also be noted that all of the combined characters mentioned
+   above exist in precomposed form at separate positions in the Unicode
+   chart.  However, by invoking Stringprep, the IDNA2003 algorithm also
+   rejects those code points, for reasons not discussed here.
+
+4.3.  Strings with Numbers
+
+   By requiring that the first or last character of a string be a member
+   of category R or AL, the Stringprep specification [RFC3454]
+   prohibited a string containing right-to-left characters from ending
+   with a number.
+
+   Consider the strings ALEF 5 (HEBREW LETTER ALEF + DIGIT FIVE) and 5
+   ALEF.  Displayed in an LTR context, the first one will be displayed
+   from left to right as 5 ALEF (with the 5 being considered right to
+   left because of the leading ALEF), while 5 ALEF will be displayed in
+   exactly the same order (5 taking the direction from context).
+   Clearly, only one of those should be permitted as a registered label,
+   but barring them both seems unnecessary.
+
+5.  Troublesome Situations and Guidelines
+
+   There are situations in which labels that satisfy the rule above will
+   be displayed in a surprising fashion.  The most important of these is
+   the case where a label ending in a character with Bidi property AL,
+   AN, or R occurs before a label beginning with a character of Bidi
+   property EN.  In that case, the number will appear to move into the
+   label containing the right-to-left character, violating the Character
+   Grouping requirement.
+
+   If the label that occurs after the right-to-left label itself
+   satisfies the Bidi criterion, the requirements will be satisfied in
+   all cases (this is the reason why the criterion talks about strings
+   containing L in some cases).  However, the IDNABIS WG concluded that
+   this could not be required for several reasons:
+
+   o  There is a large current deployment of ASCII domain names starting
+      with digits.  These cannot possibly be invalidated.
+
+   o  Domain names are often constructed piecemeal, for instance, by
+      combining a string with the content of a search list.  This may
+      occur after IDNA processing, and thus in part of the code that is
+      not IDNA-aware, making detection of the undesirable combination
+      impossible.
+
+
+
+
+
+
+
+Alvestrand & Karp            Standards Track                   [Page 12]
+
+RFC 5893                   IDNA Right to Left                August 2010
+
+
+   o  Even if a label is registered under a "safe" label, there may be a
+      DNAME [RFC2672] with an "unsafe" label that points to the "safe"
+      label, thus creating seemingly valid names that would not satisfy
+      the criterion.
+
+   o  Wildcards create the odd situation where a label is "valid" (can
+      be looked up successfully) without the zone owner knowing that
+      this label exists.  So an owner of a zone whose name starts with a
+      digit and contains a wildcard has no way of controlling whether or
+      not names with RTL labels in them are looked up in his zone.
+
+   Rather than trying to suggest rules that disallow all such
+   undesirable situations, this document merely warns about the
+   possibility, and leaves it to application developers to take whatever
+   measures they deem appropriate to avoid problematic situations.
+
+6.  Other Issues in Need of Resolution
+
+   This document concerns itself only with the rules that are needed
+   when dealing with domain names with characters that have differing
+   Bidi properties, and considers characters only in terms of their Bidi
+   properties.  All other issues with scripts that are written from
+   right to left must be considered in other contexts.
+
+   One such issue is the need to keep numbers separate.  Several scripts
+   are used with multiple sets of numbers -- most commonly they use
+   Latin numbers and a script-specific set of numbers, but in the case
+   of Arabic, there are two sets of "Arabic-Indic" digits involved.
+
+   The algorithm in this document disallows occurrences of AN-class
+   characters ("Arabic-Indic digits", U+0660 to U+0669) together with
+   EN-class characters (which includes "European" digits, U+0030 to
+   U+0039 and "extended Arabic-Indic digits", U+06F0 to U+06F9), but
+   does not help in preventing the mixing of, for instance, Bengali
+   digits (U+09E6 to U+09EF) and Gujarati digits (U+0AE6 to U+0AEF),
+   both of which have Bidi class L.  A registry or script community that
+   wishes to create rules restricting the mixing of digits in a label
+   will be able to specify these restrictions at the registry level.
+   Some rules are also specified at the protocol level.
+
+   Another set of issues concerns the proper display of IDNs with a
+   mixture of LTR and RTL labels, or only RTL labels.
+
+   It is unrealistic to expect that applications will display domain
+   names using embedded formatting codes between their labels (for one
+   thing, no reliable algorithms for identifying domain names in running
+   text exist); thus, the display order will be determined by the Bidi
+   algorithm.  Thus, a sequence (in network order) of R1.R2.ltr will be
+
+
+
+Alvestrand & Karp            Standards Track                   [Page 13]
+
+RFC 5893                   IDNA Right to Left                August 2010
+
+
+   displayed in the order 2R.1R.ltr in an LTR context, which might
+   surprise someone expecting to see labels displayed in hierarchical
+   order.  People used to working with text that mixes LTR and RTL
+   strings might not be so surprised by this.  Again, this memo does not
+   attempt to suggest a solution to this problem.
+
+7.  Compatibility Considerations
+
+7.1.  Backwards Compatibility Considerations
+
+   As with any change to an existing standard, it is important to
+   consider what happens with existing implementations when the change
+   is introduced.  Some troublesome cases include:
+
+   o  An old program used to input the newly allowed label.  If the old
+      program checks the input against RFC 3454, some labels will not be
+      allowed, and domain names containing those labels will remain
+      inaccessible.
+
+   o  An old program is asked to display the newly allowed label, and
+      checks it against RFC 3454 before displaying.  The program will
+      perform some kind of fallback, most likely displaying the label in
+      A-label form.
+
+   o  An old program tries to display the newly allowed label.  If the
+      old program has code for displaying the last character of a label
+      that is different from the code used to display the characters in
+      the middle of the label, the display may be inconsistent and cause
+      confusion.
+
+   One particular example of the last case is if a program chooses to
+   examine the last character (in network order) of a string in order to
+   determine its directionality, rather than its first.  If it finds an
+   NSM character and tries to display the string as if it was a
+   left-to-right string, the resulting display may be interesting, but
+   not useful.
+
+   The editors believe that these cases will have a less harmful impact
+   in practice than continuing to deny the use of words from the
+   languages for which these strings are necessary as IDN labels.
+
+   This specification does not forbid using leading European digits in
+   ASCII-only labels, since this would conflict with a large installed
+   base of such labels, and would increase the scope of the
+   specification from RTL labels to all labels.  The harm resulting from
+   this limitation of scope is described in Section 5.  Registries and
+   private zone managers can check for this particular condition before
+   they allow registration of any RTL label.  Generally, it is best to
+
+
+
+Alvestrand & Karp            Standards Track                   [Page 14]
+
+RFC 5893                   IDNA Right to Left                August 2010
+
+
+   disallow registration of any right-to-left strings in a zone where
+   the label at the level above begins with a digit.
+
+7.2.  Forward Compatibility Considerations
+
+   This text is intentionally specified strictly in terms of the Unicode
+   Bidi properties.  The determination that the condition is sufficient
+   to fulfill the criteria depends on the Unicode Bidi algorithm; it is
+   unlikely that drastic changes will be made to this algorithm.
+
+   However, the determination of validity for any string depends on the
+   Unicode Bidi property values, which are not declared immutable by the
+   Unicode Consortium.  Furthermore, the behavior of the algorithm for
+   any given character is likely to be linguistically and culturally
+   sensitive, so while it should occur rarely, it is possible that later
+   versions of the Unicode Standard may change the Bidi properties
+   assigned to certain Unicode characters.
+
+   This memo does not propose a solution for this problem.
+
+8.  Security Considerations
+
+   The display behavior of mixed-direction text can be extremely
+   surprising to users who are not used to it; for instance, cut and
+   paste of a piece of text can cause the text to display differently at
+   the destination, if the destination is in another directionality
+   context, and adding a character in one place of a text can cause
+   characters some distance from the point of insertion to change their
+   display position.  This is, however, not a phenomenon unique to the
+   display of domain names.
+
+   The new IDNA protocol, and particularly these new Bidi rules, will
+   allow some strings to be used in IDNA contexts that are not allowed
+   today.  It is possible that differences in the interpretation of
+   labels between implementations of IDNA2003 and IDNA2008 could pose a
+   security risk, but it is difficult to envision any specific
+   instantiation of this.
+
+   Any rational attempt to compute, for instance, a hash over an
+   identifier processed by IDNA would use network order for its
+   computation, and thus be unaffected by the new rules proposed here.
+
+   While it is not believed to pose a problem, if display routines had
+   been written with specific knowledge of the RFC 3454 IDNA
+   prohibitions, it is possible that the potential problems noted under
+   "Backwards Compatibility Considerations" could cause new kinds of
+   confusion.
+
+
+
+
+Alvestrand & Karp            Standards Track                   [Page 15]
+
+RFC 5893                   IDNA Right to Left                August 2010
+
+
+9.  Acknowledgements
+
+   While the listed editors held the pen, this document represents the
+   joint work and conclusions of an ad hoc design team.  In addition to
+   the editors, this consisted of, in alphabetic order, Tina Dam, Patrik
+   Faltstrom, and John Klensin.  Many further specific contributions and
+   helpful comments were received from the people listed below, and
+   others who have contributed to the development and use of the IDNA
+   protocols.
+
+   The particular formulation of the Bidi rule in Section 2 was
+   suggested by Matitiahu Allouche.
+
+   The team wishes, in particular, to thank Roozbeh Pournader for
+   calling its attention to the issue with the Thaana script, Paul
+   Hoffman for pointing out the need to be explicit about backwards
+   compatibility considerations, Ken Whistler for suggesting the basis
+   of the formalized "Character Grouping" requirement, Mark Davis for
+   commentary, Erik van der Poel for careful review, comments, and
+   verification of the rulesets, Marcos Sanz, Andrew Sullivan, and Pete
+   Resnick for reviews, and Vint Cerf for chairing the working group and
+   contributing massively to getting the documents finished.
+
+10.  References
+
+10.1.  Normative References
+
+   [RFC5890]      Klensin, J., "Internationalized Domain Names for
+                  Applications (IDNA): Definitions and Document
+                  Framework", RFC 5890, August 2010.
+
+   [Unicode-UAX9] The Unicode Consortium, "Unicode Standard Annex #9:
+                  Unicode Bidirectional Algorithm", September 2009,
+                  <http://www.unicode.org/reports/tr9/>.
+
+   [Unicode52]    The Unicode Consortium.  The Unicode Standard, Version
+                  5.2.0, defined by: "The Unicode Standard, Version
+                  5.2.0", (Mountain View, CA: The Unicode Consortium,
+                  2009. ISBN 978-1-936213-00-9).
+                  <http://www.unicode.org/versions/Unicode5.2.0/>.
+
+
+
+
+
+
+
+
+
+
+
+Alvestrand & Karp            Standards Track                   [Page 16]
+
+RFC 5893                   IDNA Right to Left                August 2010
+
+
+10.2.  Informative References
+
+   [RFC2672]      Crawford, M., "Non-Terminal DNS Name Redirection",
+                  RFC 2672, August 1999.
+
+   [RFC3454]      Hoffman, P. and M. Blanchet, "Preparation of
+                  Internationalized Strings ("stringprep")", RFC 3454,
+                  December 2002.
+
+   [RFC5891]      Klensin, J., "Internationalized Domain Names in
+                  Applications (IDNA): Protocol", RFC 5891, August 2010.
+
+   [SYO]          "The Standardized Yiddish Orthography: Rules of
+                  Yiddish Spelling, 6th ed., New York, ISBN
+                  0-914512-25-0", 1999.
+
+Authors' Addresses
+
+   Harald Tveit Alvestrand (editor)
+   Google
+   Beddingen 10
+   Trondheim,   7014
+   Norway
+
+   EMail: harald@alvestrand.no
+
+
+   Cary Karp
+   Swedish Museum of Natural History
+   Frescativ. 40
+   Stockholm,   10405
+   Sweden
+
+   Phone: +46 8 5195 4055
+   Fax:
+   EMail: ck@nic.museum
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Alvestrand & Karp            Standards Track                   [Page 17]
+
author	Thomas Voss <mail@thomasvoss.com>	2024-11-27 20:54:24 +0100
committer	Thomas Voss <mail@thomasvoss.com>	2024-11-27 20:54:24 +0100
commit	4bfd864f10b68b71482b35c818559068ef8d5797 (patch)
tree	e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc5893.txt
parent	ea76e11061bda059ae9f9ad130a9895cc85607db (diff)