summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc8228.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/rfc/rfc8228.txt')
-rw-r--r--doc/rfc/rfc8228.txt1347
1 files changed, 1347 insertions, 0 deletions
diff --git a/doc/rfc/rfc8228.txt b/doc/rfc/rfc8228.txt
new file mode 100644
index 0000000..714dfe5
--- /dev/null
+++ b/doc/rfc/rfc8228.txt
@@ -0,0 +1,1347 @@
+
+
+
+
+
+
+Internet Engineering Task Force (IETF) A. Freytag
+Request for Comments: 8228 August 2017
+Category: Informational
+ISSN: 2070-1721
+
+
+ Guidance on Designing Label Generation Rulesets (LGRs) Supporting
+ Variant Labels
+
+Abstract
+
+ Rules for validating identifier labels and alternate representations
+ of those labels (variants) are known as Label Generation Rulesets
+ (LGRs); they are used for the implementation of identifier systems
+ such as Internationalized Domain Names (IDNs). This document
+ describes ways to design LGRs to support variant labels. In
+ designing LGRs, it is important to ensure that the label generation
+ rules are consistent and well behaved in the presence of variants.
+ The design decisions can then be expressed using the XML
+ representation of LGRs that is defined in RFC 7940.
+
+Status of This Memo
+
+ This document is not an Internet Standards Track specification; it is
+ published for informational purposes.
+
+ This document is a product of the Internet Engineering Task Force
+ (IETF). It has been approved for publication by the Internet
+ Engineering Steering Group (IESG). Not all documents approved by the
+ IESG are a candidate for any level of Internet Standard; see
+ Section 2 of RFC 7841.
+
+ Information about the current status of this document, any errata,
+ and how to provide feedback on it may be obtained at
+ http://www.rfc-editor.org/info/rfc8228.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Freytag Informational [Page 1]
+
+RFC 8228 Variant Rules August 2017
+
+
+Copyright Notice
+
+ Copyright (c) 2017 IETF Trust and the persons identified as the
+ document authors. All rights reserved.
+
+ This document is subject to BCP 78 and the IETF Trust's Legal
+ Provisions Relating to IETF Documents
+ (http://trustee.ietf.org/license-info) in effect on the date of
+ publication of this document. Please review these documents
+ carefully, as they describe your rights and restrictions with respect
+ to this document. Code Components extracted from this document must
+ include Simplified BSD License text as described in Section 4.e of
+ the Trust Legal Provisions and are provided without warranty as
+ described in the Simplified BSD License.
+
+Table of Contents
+
+ 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
+ 2. Variant Relations . . . . . . . . . . . . . . . . . . . . . . 4
+ 3. Symmetry and Transitivity . . . . . . . . . . . . . . . . . . 5
+ 4. A Word on Notation . . . . . . . . . . . . . . . . . . . . . 5
+ 5. Variant Mappings . . . . . . . . . . . . . . . . . . . . . . 6
+ 6. Variant Labels . . . . . . . . . . . . . . . . . . . . . . . 7
+ 7. Variant Types and Label Dispositions . . . . . . . . . . . . 7
+ 8. Allocatable Variants . . . . . . . . . . . . . . . . . . . . 8
+ 9. Blocked Variants . . . . . . . . . . . . . . . . . . . . . . 9
+ 10. Pure Variant Labels . . . . . . . . . . . . . . . . . . . . . 10
+ 11. Reflexive Variants . . . . . . . . . . . . . . . . . . . . . 11
+ 12. Limiting Allocatable Variants by Subtyping . . . . . . . . . 12
+ 13. Allowing Mixed Originals . . . . . . . . . . . . . . . . . . 14
+ 14. Handling Out-of-Repertoire Variants . . . . . . . . . . . . . 15
+ 15. Conditional Variants . . . . . . . . . . . . . . . . . . . . 16
+ 16. Making Conditional Variants Well Behaved . . . . . . . . . . 18
+ 17. Variants for Sequences . . . . . . . . . . . . . . . . . . . 19
+ 18. Corresponding XML Notation . . . . . . . . . . . . . . . . . 21
+ 19. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22
+ 20. Security Considerations . . . . . . . . . . . . . . . . . . . 23
+ 21. References . . . . . . . . . . . . . . . . . . . . . . . . . 23
+ 21.1. Normative References . . . . . . . . . . . . . . . . . . 23
+ 21.2. Informative References . . . . . . . . . . . . . . . . . 23
+ Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 24
+ Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 24
+
+
+
+
+
+
+
+
+
+Freytag Informational [Page 2]
+
+RFC 8228 Variant Rules August 2017
+
+
+1. Introduction
+
+ Label Generation Rulesets (LGRs) that define the set of permissible
+ labels may be applied to identifier systems that rely on labels, such
+ as the Domain Name System (DNS) [RFC1034] [RFC1035]. To date, LGRs
+ have mostly been used to define policies for implementing
+ Internationalized Domain Names (IDNs) using IDNA2008 [RFC5890]
+ [RFC5891] [RFC5892] [RFC5893] [RFC5894] in the DNS. This document
+ aims to discuss the generation of LGRs for such circumstances, but
+ the techniques and considerations here are almost certainly
+ applicable to a wider range of internationalized identifiers.
+
+ In addition to determining whether a given label is eligible, LGRs
+ may also define the condition under which alternate representations
+ of these labels, so-called "variant labels", may exist and their
+ status (disposition). In the most general sense, variant labels are
+ typically labels that are either visually or semantically
+ indistinguishable from another label in the context of the writing
+ system or script supported by the LGR. Unlike merely similar labels,
+ where there may be a measurable degree of similarity, variant labels
+ considered here represent a form of equivalence in meaning or
+ appearance. What constitutes an appropriate variant in any writing
+ system or given context, particularly in the DNS, is assumed to have
+ been determined ahead of time and therefore is not a subject of this
+ document.
+
+ Once identified, variant labels are typically delegated to some
+ entity together with the applied-for label, or permanently reserved,
+ based on the disposition derived from the LGR. Correctly defined,
+ variant labels can improve the security of an LGR, yet successfully
+ defining variant rules for an LGR so that the result is well behaved
+ is not always trivial. This document describes the basic
+ considerations and constraints that must be taken into account and
+ gives examples of what might be use cases for different types of
+ variant specifications in an LGR.
+
+ This document does not address whether variants are an appropriate
+ means to solve any given issue or the basis on which they should be
+ defined. It is intended to explain in more detail the effects of
+ various declarations and the trade-offs in making design choices. It
+ implicitly assumes that any LGR will be expressed using the XML
+ representation defined in [RFC7940] and therefore conforms to any
+ requirements stated therein. Purely for clarity of exposition,
+ examples in this document use a more compact notation than the XML
+ syntax defined in [RFC7940]. However, the reader is expected to have
+ some familiarity with the concepts described in that RFC (see
+ Section 4).
+
+
+
+
+Freytag Informational [Page 3]
+
+RFC 8228 Variant Rules August 2017
+
+
+ The user of any identifier system, such as the DNS, interacts with it
+ in the context of labels; variants are experienced as variant labels,
+ i.e., two (or more) labels that are functionally "same as" under the
+ conventions of the writing system used, even though their code point
+ sequences are different. An LGR specification, on the other hand,
+ defines variant mappings between code points and, only in a secondary
+ step, derives the variant labels from these mappings. For a
+ discussion of this process, see [RFC7940].
+
+ The designer of an LGR can control whether some or all of the variant
+ labels created from an original label should be allocatable, i.e.,
+ available for allocation (to the original applicant), or whether some
+ or all of these labels should be blocked instead, i.e., remain not
+ allocatable (to anyone). This document describes how this choice of
+ label disposition is accomplished (see Section 7).
+
+ The choice of desired label disposition would be based on the
+ expectations of the users of the particular zone; it is not the
+ subject of this document. Likewise, this document does not address
+ the possibility of an LGR defining custom label dispositions.
+ Instead, this document suggests ways of designing an LGR to achieve
+ the selected design choice for handling variants in the context of
+ the two standard label dispositions: "allocatable" and "blocked".
+
+ The information in this document is based on operational experience
+ gained in developing LGRs for a wide number of languages and scripts
+ using RFC 7940. This information is provided here as a benefit to
+ the wider community. It does not alter or change the specification
+ found in RFC 7940 in any way.
+
+2. Variant Relations
+
+ A variant relation is fundamentally a "same as" relation; in other
+ words, it is an equivalence relation. Now, the strictest sense of
+ "same as" would be equality, and for any equality, we have both
+ symmetry
+
+ A = B => B = A
+
+ and transitivity
+
+ A = B and B = C => A = C
+
+
+
+
+
+
+
+
+
+Freytag Informational [Page 4]
+
+RFC 8228 Variant Rules August 2017
+
+
+ The variant relation with its functional sense of "same as" must
+ really satisfy the same constraint. Once we say A is the "same as"
+ B, we also assert that B is the "same as" A. In this document, the
+ symbol "~" means "has a variant relation with". Thus, we get
+
+ A ~ B => B ~ A
+
+ Likewise, if we make the same claim for B and C (B ~ C), then we get
+ A ~ C, because if B is the "same as" both A and C, then A must be the
+ "same as" C:
+
+ A ~ B and B ~ C => A ~ C
+
+3. Symmetry and Transitivity
+
+ Not all potential relations between labels constitute equivalence,
+ and those that do not are not transitive and may not be symmetric.
+ For example, the degree to which labels are confusable is not
+ transitive: two labels can be confusingly similar to a third without
+ necessarily being confusable with each other, such as when the third
+ one has a shape that is "in between" the other two. In contrast, a
+ relation based on identical or effectively identical appearance would
+ meet the criterion of transitivity, and we would consider it a
+ variant relation. Examples of variant relations include other forms
+ of equivalence, such as semantic equivalence.
+
+ Using [RFC7940], a set of mappings could be defined that is neither
+ symmetric nor transitive; such a specification would be formally
+ valid. However, a symmetric and transitive set of mappings is
+ strongly preferred as a basis for an LGR, not least because of the
+ benefits from an implementation point of view; for example, if all
+ mappings are symmetric and transitive, it greatly simplifies the
+ check for collisions between labels with variants. For this reason,
+ we will limit the discussion in this document to those relations that
+ are symmetric and transitive. Incidentally, it is often
+ straightforward to verify mechanically whether an LGR is symmetric
+ and/or transitive and to compute any mappings required to make it so
+ (but see Section 15).
+
+4. A Word on Notation
+
+ [RFC7940] defines an XML schema for Label Generation Rulesets in
+ general and variant code points and sequences in particular (see
+ Section 18). That notation is rather verbose and can easily obscure
+ salient features to anyone not trained to read XML. For this reason,
+ this document uses a symbolic shorthand notation in presenting the
+ examples for discussion. This shorthand is merely a didactic tool
+
+
+
+
+Freytag Informational [Page 5]
+
+RFC 8228 Variant Rules August 2017
+
+
+ for presentation and is not intended as an alternative to or
+ replacement for the XML syntax that is used in formally specifying an
+ LGR under [RFC7940].
+
+ When it comes time to capture the LGR in a formal definition, the
+ notation used for any of the examples in this document can be
+ converted to the XML format as described in Section 18.
+
+5. Variant Mappings
+
+ So far, we have treated variant relations as simple "same as"
+ relations, ignoring that each relation representing equivalence would
+ consist of a symmetric pair of reciprocal mappings. In this
+ document, the symbol "-->" means "maps to".
+
+ A ~ B => A --> B, B --> A
+
+ In an LGR, these mappings are not defined directly between labels but
+ between code points (or code point sequences; see Section 17). In
+ the transitive case, given
+
+ A ~ B => A --> B, B --> A
+
+ A ~ C => A --> C, C --> A
+
+ we also get
+
+ B ~ C => B --> C, C --> B
+
+ for a total of six possible mappings. Conventionally, these are
+ listed in tables in order of the source code point, like so:
+
+ A --> B
+ A --> C
+ B --> A
+ B --> C
+ C --> A
+ C --> B
+
+ As we can see, A, B, and C can each be mapped two ways.
+
+
+
+
+
+
+
+
+
+
+
+Freytag Informational [Page 6]
+
+RFC 8228 Variant Rules August 2017
+
+
+6. Variant Labels
+
+ To create a variant label, each code point in the original label is
+ successively replaced by all variant code points defined by a mapping
+ from the original code point. For a label AAA (the letter "A" three
+ times), the variant labels (given the mappings from the transitive
+ example above) would be
+
+ AAB
+ ABA
+ ABB
+ BAA
+ BAB
+ BBA
+ BBB
+ AAC
+ ...
+ CCC
+
+ So far, we have merely defined what the variant labels are, but we
+ have not considered their possible dispositions. In the next
+ section, we discuss how to set up the variant mappings so that some
+ variant labels are mutually exclusive (blocked), but some may be
+ allocated to the same applicant as the original label (allocatable).
+
+7. Variant Types and Label Dispositions
+
+ Assume we wanted to allow a variant relation between code points O
+ and A, and perhaps between O and B or O and C as well. Assuming
+ transitivity, this would give us:
+
+ O ~ A ~ B ~ C
+
+ Now, further assume that we would like to distinguish the case where
+ someone applies for OOO from the case where someone applies for the
+ label ABC. In this case, we would like to allocate only the applied-
+ for label OOO, but in the latter case, we would like to also allow
+ the allocation of either the label OOO or the variant label ABC, or
+ both, but not of any of the other possible variant labels, like OAO,
+ BCO, or the like. (A real-world example might be the case where O
+ represents an unaccented letter, while A, B, and C might represent
+ various accented forms of the same letter. Because unaccented
+ letters are a common fallback, there might be a desire to allocate an
+ unaccented label as a variant, but not the other way around.)
+
+ How would we specify such a distinction?
+
+
+
+
+
+Freytag Informational [Page 7]
+
+RFC 8228 Variant Rules August 2017
+
+
+ The answer lies in labeling the mappings A --> O, B --> O, and C -->
+ O with the type "allocatable" and the mappings O --> A, O --> B, and
+ O --> C with the type "blocked". In this document, the symbol "x-->"
+ means "maps with type blocked", and the symbol "a-->" means "maps
+ with type allocatable". Thus:
+
+ O x--> A
+ O x--> B
+ O x--> C
+ A a--> O
+ B a--> O
+ C a--> O
+
+ When we generate all permutations of labels, we use mappings with
+ different types depending on which code points we start from. The
+ set of all permuted variant labels would be the same, but the
+ disposition of the variant label depends on which label we start from
+ (we call that label the "original" or "applied-for" label).
+
+ In creating an LGR with variants, all variant mappings should always
+ be labeled with a type ([RFC7940] does not formally require a type,
+ but any well-behaved LGR would be fully typed). By default, these
+ types correspond directly to the dispositions for variant labels,
+ with the most restrictive type determining the disposition of the
+ variant label. However, as we shall see later, it is sometimes
+ useful to assign types from a wider array of values than the final
+ dispositions for the labels and then define explicitly how to derive
+ label dispositions from them.
+
+8. Allocatable Variants
+
+ If we start with AAA and use the mappings from Section 7, the
+ permutation OOO will be the result of applying the mapping A a--> O
+ at each code point. That is, only mappings with type "a"
+ (allocatable) were used. To know whether we can allocate both the
+ label OOO and the original label AAA, we track the types of the
+ mappings used in generating the label.
+
+ We record the variant types for each of the variant mappings used in
+ creating the permutation in an ordered list. Such an ordered list of
+ variant types is called a "variant type list". In running text, we
+ often show it enclosed in square brackets. For example, [a x -]
+ means the variant label was derived from a variant mapping with the
+ "a" variant type in the first code point position, "x" in the second
+ code point position, and the original code point in the third
+ position ("-" means "no variant mapping").
+
+
+
+
+
+Freytag Informational [Page 8]
+
+RFC 8228 Variant Rules August 2017
+
+
+ For our example permutation, we get the following variant type list
+ (brackets dropped):
+
+ AAA --> OOO : a a a
+
+ From the variant type list, we derive a "variant type set", denoted
+ by curly braces, that contains an unordered set of unique variant
+ types in the variant type list. For the variant type list for the
+ given permutation, [a a a], the variant type set is { a }, which has
+ a single element "a".
+
+ Deciding whether to allow the allocation of a variant label then
+ amounts to deriving a disposition for the variant label from the
+ variant type set created from the variant mappings that were used to
+ create the label. For example, the derivation
+
+ if "all variants" = "a" => set label disposition to "allocatable"
+
+ would allow OOO to be allocated, because the types of all variant
+ mappings used to create that variant label from AAA are "a".
+
+ The "all-variants" condition is tolerant of an extra "-" in the
+ variant set (unlike the "only-variants" condition described in
+ Section 10). So, had we started with AOA, OAA, or AAO, the variant
+ set for the permuted variant OOO would have been { a - } because in
+ each case one of the code points remains the same code point as the
+ original. The "-" means that because of the absence of a mapping O
+ --> O, there is no variant type for the O in each of these labels.
+
+ The "all-variants" = "a" condition ignores the "-", so using the
+ derivation from above, we find that OOO is an allocatable variant for
+ each of the labels AOA, OAA, or AAO.
+
+ Allocatable variant labels, especially large numbers of allocatable
+ variants per label, incur a certain cost to users of the LGR. A
+ well-behaved LGR will minimize the number of allocatable variants.
+
+9. Blocked Variants
+
+ Blocked variants are not available to another registrant. They
+ therefore protect the applicant of the original label from someone
+ else registering a label that is the "same as" under some user-
+ perceived metric. Blocked variants can be a useful tool even for
+ scripts for which no allocatable labels are ever defined.
+
+
+
+
+
+
+
+Freytag Informational [Page 9]
+
+RFC 8228 Variant Rules August 2017
+
+
+ If we start with OOO and use the mappings from Section 7, the
+ permutation AAA will have been the result of applying only mappings
+ with type "blocked", and we cannot allocate the label AAA, only the
+ original label OOO. This corresponds to the following derivation:
+
+ if "any variants" = "x" => set label disposition to "blocked"
+
+ Additionally, to prevent allocating ABO as a variant label for AAA,
+ we need to make sure that the mapping A --> B has been defined with
+ type "blocked", as in
+
+ A x--> B
+
+ so that
+
+ AAA --> ABO: - x a.
+
+ Thus, the set {x a} contains at least one "x" and satisfies the
+ derivation of a blocked disposition for ABO when AAA is applied for.
+
+ If an LGR results in a symmetric and transitive set of variant
+ labels, then the task of determining whether a label or its variants
+ collide with another label or its variants can be implemented very
+ efficiently. Symmetry and transitivity imply that sets of labels
+ that are mutual variants of each other are disjoint from all other
+ such sets. Only labels within the same set can be variants of each
+ other. Identifying the variant set can be an O(1) operation, and
+ enumerating all variants is not necessary.
+
+10. Pure Variant Labels
+
+ Now, if we wanted to prevent allocation of AOA when we start from
+ AAA, we would need a rule disallowing a mix of original code points
+ and variant code points; this is easily accomplished by use of the
+ "only-variants" qualifier, which requires that the label consist
+ entirely of variants and that all the variants are from the same set
+ of types.
+
+ if "only-variants" = "a" => set label disposition to "allocatable"
+
+ The two code points A in AOA are not arrived at by variant mappings,
+ because the code points are unchanged and no variant mappings are
+ defined for A --> A. So, in our example, the set of variant mapping
+ types is
+
+ AAA --> AOA: - a -
+
+
+
+
+
+Freytag Informational [Page 10]
+
+RFC 8228 Variant Rules August 2017
+
+
+ but unlike the "all-variants" condition, "only-variants" requires a
+ variant type set { a } corresponding to a variant type list [a a a]
+ (no - allowed). By adding a final derivation
+
+ else if "any-variants" = "a" => set label disposition to "blocked"
+
+ and executing that derivation only on any remaining labels, we
+ disallow AOA when starting from AAA but still allow OOO.
+
+ Derivation conditions are always applied in order, with later
+ derivations only applying to labels that did not match any earlier
+ conditions, as indicated by the use of "else" in the last example.
+ In other words, they form a cascade.
+
+11. Reflexive Variants
+
+ But what if we started from AOA? We would expect the original label
+ OOO to be allocatable, but, using the mappings from Section 7, the
+ variant type set would be
+
+ AOA --> OOO: a - a
+
+ because the middle O is unchanged from the original code point. Here
+ is where we use a reflexive mapping. Realizing that O is the "same
+ as" O, we can map it to itself. This is normally redundant, but
+ adding an explicit reflexive mapping allows us to specify a
+ disposition on that mapping:
+
+ O a--> O
+
+ With that, the variant type list for AOA --> OOO becomes:
+
+ AOA --> OOO: a a a
+
+ and the label OOO again passes the derivation condition
+
+ if "only-variants" = "a" => set label disposition to "allocatable"
+
+ as desired. This use of reflexive variants is typical whenever
+ derivations with the "only-variants" qualifier are used. If any code
+ point uses a reflexive variant, a well-behaved LGR would specify an
+ appropriate reflexive variant for all code points.
+
+
+
+
+
+
+
+
+
+Freytag Informational [Page 11]
+
+RFC 8228 Variant Rules August 2017
+
+
+12. Limiting Allocatable Variants by Subtyping
+
+ As we have seen, the number of variant labels can potentially be
+ large, due to combinatorics. Sometimes it is possible to divide
+ variants into categories and to stipulate that only variant labels
+ with variants from the same category should be allocatable. For some
+ LGRs, this constraint can be implemented by a rule that disallows
+ code points from different categories to occur in the same
+ allocatable label. For other LGRs, the appropriate mechanism may be
+ dividing the allocatable variants into subtypes.
+
+ To recap, in the standard case, a code point C can have (up to) two
+ types of variant mappings
+
+ C x--> X
+ C a--> A
+
+ where a--> means a variant mapping with type "allocatable" and x-->
+ means "blocked". For the purpose of the following discussion, we
+ name the target code point with the corresponding uppercase letter.
+
+ Subtyping allows us to distinguish among different types of
+ allocatable variants. For example, we can define three new types:
+ "s", "t", and "b". Of these, "s" and "t" are mutually incompatible,
+ but "b" is compatible with either "s" or "t" (in this case, "b"
+ stands for "both"). A real-world example for this might be variant
+ mappings appropriate for "simplified" or "traditional" Chinese
+ variants, or appropriate for both.
+
+ With subtypes defined as above, a code point C might have (up to)
+ four types of variant mappings
+
+ C x--> X
+ C s--> S
+ C t--> T
+ C b--> B
+
+ and explicit reflexive mappings of one of these types
+
+ C s--> C
+ C t--> C
+ C b--> C
+
+ As before, all mappings must have one and only one type, but each
+ code point may map to any number of other code points.
+
+
+
+
+
+
+Freytag Informational [Page 12]
+
+RFC 8228 Variant Rules August 2017
+
+
+ We define the compatibility of "b" with "t" or "s" by our choice of
+ derivation conditions as follows
+
+ if "any-variants" = "x" => blocked
+ else if "only-variants" = "s" or "b" => allocatable
+ else if "only-variants" = "t" or "b" => allocatable
+ else if "any-variants" = "s" or "t" or "b" => blocked
+
+ An original label of four code points
+
+ CCCC
+
+ may have many variant labels, such as this example listed with its
+ corresponding variant type list:
+
+ CCCC --> XSTB : x s t b
+
+ This variant label is blocked because to get from C to B required
+ x-->. (Because variant mappings are defined for specific source code
+ points, we need to show the starting label for each of these
+ examples, not merely the code points in the variant label.) The
+ variant label
+
+ CCCC --> SSBB : s s b b
+
+ is allocatable, because the variant type list contains only
+ allocatable mappings of subtype "s" or "b", which we have defined as
+ being compatible by our choice of derivations. The actual set of
+ variant types {s, b} has only two members, but the examples are
+ easier to follow if we list each type. The label
+
+ CCCC --> TTBB : t t b b
+
+ is again allocatable, because the variant type set {t, b} contains
+ only allocatable mappings of the mutually compatible allocatable
+ subtypes "t" or "b". In contrast,
+
+ CCCC --> SSTT : s s t t
+
+ is not allocatable, because the type set contains incompatible
+ subtypes "t" and "s" and thus would be blocked by the final
+ derivation.
+
+
+
+
+
+
+
+
+
+Freytag Informational [Page 13]
+
+RFC 8228 Variant Rules August 2017
+
+
+ The variant labels
+
+ CCCC --> CSBB : c s b b
+ CCCC --> CTBB : c t b b
+
+ are only allocatable based on the subtype for the C --> C mapping,
+ which is denoted here by "c" and (depending on what was chosen for
+ the type of the reflexive mapping) could correspond to "s", "t", or
+ "b".
+
+ If the subtype is "s", the first of these two labels is allocatable;
+ if it is "t", the second of these two labels is allocatable; if it is
+ "b", both labels are allocatable.
+
+ So far, the scheme does not seem to have brought any huge reduction
+ in allocatable variant labels, but that is because we tacitly assumed
+ that C could have all three types of allocatable variants "s", "t",
+ and "b" at the same time.
+
+ In a real-world example, the types "s", "t", and "b" are assigned so
+ that each code point C normally has, at most, one non-reflexive
+ variant mapping labeled with one of these subtypes, and all other
+ mappings would be assigned type "x" (blocked). This holds true for
+ most code points in existing tables (such as those used in current
+ IDN Top-Level Domains (TLDs)), although certain code points have
+ exceptionally complex variant relations and may have an extra
+ mapping.
+
+13. Allowing Mixed Originals
+
+ If the desire is to allow original labels (but not variant labels)
+ that are s/t mixed, then the scheme needs to be slightly refined to
+ distinguish between reflexive and non-reflexive variants. In this
+ document, the symbol "r-n" means "a reflexive (identity) mapping of
+ type 'n'". The reflexive mappings of the preceding section thus
+ become:
+
+ C r-s--> C
+ C r-t--> C
+ C r-b--> C
+
+ With this convention, and redefining the derivations
+
+ if "any-variants" = "x" => blocked
+ else if "only-variants" = "s" or "r-s" or "b" or "r-b" => allocatable
+ else if "only-variants" = "t" or "r-t" or "b" or "r-b" => allocatable
+ else if "any-variants" = "s" or "t" or "b" => blocked
+ else => allocatable
+
+
+
+Freytag Informational [Page 14]
+
+RFC 8228 Variant Rules August 2017
+
+
+ any labels that contain only reflexive mappings of otherwise mixed
+ type (in other words, any mixed original label) now fall through, and
+ their disposition is set to "allocatable" in the final derivation.
+
+ In a well-behaved LGR, it is preferable to explicitly define the
+ derivation for allocatable labels instead of using a fall through.
+ In the derivation above, code points without any variant mappings
+ fall through and become allocatable by default if they are part of an
+ original label. Especially in a large repertoire, it can be
+ difficult to identify which code points are affected. Instead, it is
+ preferable to mark them with their own reflexive mapping type
+ "neither" or "r-n".
+
+ C r-n--> C
+
+ With that, we can change
+
+ else => allocatable
+
+ to
+
+ else if "only-variants" = "r-s" or "r-t" or "r-b" or "r-n"
+ => allocatable
+ else => invalid
+
+ This makes the intent more explicit, and by ensuring that all code
+ points in the LGR have a reflexive mapping of some kind, it is easier
+ to verify the correct assignment of their types.
+
+14. Handling Out-of-Repertoire Variants
+
+ At first, it may seem counterintuitive to define variants that map to
+ code points that are not part of the repertoire. However, for zones
+ for which multiple LGRs are defined, there may be situations where
+ labels valid under one LGR should be blocked if a label under another
+ LGR is already delegated. This situation can arise whether or not
+ the repertoires of the affected LGRs overlap and, where repertoires
+ overlap, whether or not the labels are both restricted to the common
+ subset.
+
+ In order to handle this exclusion relation through definition of
+ variants, it is necessary to be able to specify variant mappings to
+ some code point X that is outside an LGR's repertoire, R:
+
+ C x--> X : where C = elementOf(R) and X != elementOf(R)
+
+
+
+
+
+
+Freytag Informational [Page 15]
+
+RFC 8228 Variant Rules August 2017
+
+
+ Because of symmetry, it is necessary to also specify the inverse
+ mapping in the LGR:
+
+ X x--> C : where X != elementOf(R) and C = elementOf(R)
+
+ This makes X a source of variant mappings, and it becomes necessary
+ to identify X as being outside the repertoire, so that any attempt to
+ apply for a label containing X will lead to a disposition of
+ "invalid", just as if X had never been listed in the LGR. The
+ mechanism to do this uses reflexive variants but with a new type of
+ reflexive mapping of "out-of-repertoire-var", shown as "r-o-->":
+
+ X r-o--> X
+
+ This indicates X != elementOf(R), as long as the LGR is provided with
+ a suitable derivation, so that any label containing "r-o-->" is
+ assigned a disposition of "invalid", just as if X was any other code
+ point not part of the repertoire. The derivation used is:
+
+ if "any-variant" = "out-of-repertoire-var" => invalid
+
+ It is inserted ahead of any other derivation of the "any-variant"
+ kind in the chain of derivations. As a result, instead of the
+ minimum two symmetric variants, for any out-of-repertoire variants,
+ there are a minimum of three variant mappings defined:
+
+ C x--> X
+ X x--> C
+ X r-o--> X
+
+ where C = elementOf(R) and X != elementOf(R).
+
+ Because no variant label with any code point outside the repertoire
+ could ever be allocated, the only logical choice for the non-
+ reflexive mappings to out-of-repertoire code points is "blocked".
+
+15. Conditional Variants
+
+ Variant mappings are based on whether code points are "same as" to
+ the user. In some writing systems, code points change shape based on
+ where they occur in the word (positional forms). Some code points
+ have matching shapes in some positions but not in others. In such
+ cases, the variant mapping exists only for some possible positions
+ or, more generally, only for some contexts. For other contexts, the
+ variant mapping does not exist.
+
+
+
+
+
+
+Freytag Informational [Page 16]
+
+RFC 8228 Variant Rules August 2017
+
+
+ For example, take two code points that have the same shape at the end
+ of a label (or in final position) but not in any other position. In
+ that case, they are variants only when they occur in the final
+ position, something we indicate like this:
+
+ final: C --> D
+
+ In cursively connected scripts, like Arabic, a code point may take
+ its final form when next to any following code point that interrupts
+ the cursive connection, not just at the end of a label. (We ignore
+ the isolated form to keep the discussion simple; if included, "final"
+ might be "final-or-isolate", for example).
+
+ From symmetry, we expect that the mapping D --> C should also exist
+ only when the code point D is in final position. (Similar
+ considerations apply to transitivity.)
+
+ Sometimes a code point has a final form that is practically the same
+ as that of some other code point while sharing initial and medial
+ forms with another.
+
+ final: C --> D
+ !final: C --> E
+
+ Here, the case where the condition is the opposite of final is shown
+ as "!final".
+
+ Because shapes differ by position, when a context is applied to a
+ variant mapping, it is treated independently from the same mapping in
+ other contexts. This extends to the assignment of types. For
+ example, the mapping C --> F may be "allocatable" in final position
+ but "blocked" in any other context:
+
+ final: C a--> F
+ !final: C x--> F
+
+ Now, the type assigned to the forward mapping is independent of the
+ reverse symmetric mapping or any transitive mappings. Imagine a
+ situation where the symmetric mapping is defined as F a--> C, that
+ is, all mappings from F to C are "allocatable":
+
+ final: F a--> C
+ !final: F a-->C
+
+ Why not simply write F a--> C? Because the forward mapping is
+ divided by context. Adding a context makes the two forward variant
+ mappings distinct, and that needs to be accounted for explicitly in
+ the reverse mappings so that human and machine readers can easily
+
+
+
+Freytag Informational [Page 17]
+
+RFC 8228 Variant Rules August 2017
+
+
+ verify symmetry and transitivity of the variant mappings in the LGR.
+ (This is true even though the two opposite contexts of "final" and
+ "!final" should together cover all possible cases.)
+
+16. Making Conditional Variants Well Behaved
+
+ To ensure that LGR with contextual variants is well behaved, it is
+ best to always use "fully qualified" variant mappings that always
+ agree in the names of the context rules for forward and reverse
+ mappings. It is also necessary to ensure that no label can match
+ more than one context for the same mapping. Using mutually exclusive
+ contexts, such as "final" and "!final", is an easy way to ensure
+ that.
+
+ However, it is not always necessary to define dual or multiple
+ contexts that together cover all possible cases. For example, here
+ are two contexts that do not cover all possible positional contexts:
+
+ final: C --> D
+ initial: C --> D.
+
+ A well-behaved LGR using these two contexts would define all
+ symmetric and transitive mappings involving C, D, and their variants
+ consistently in terms of the two conditions "final" and "initial" and
+ ensure that both cannot be satisfied at the same time by some label.
+
+ In addition to never defining the same mapping with two contexts that
+ may be satisfied by the same label, a well-behaved LGR never combines
+ a variant mapping with a context with the same variant mapping
+ without a context:
+
+ context: C --> D
+ C --> D
+
+ Inadvertent mixing of conditional and unconditional variants can be
+ detected and flagged by a parser, but verifying that two formally
+ distinct contexts are never satisfied by the same label would depend
+ on the interaction between labels and context rules, which means that
+ it will be up to the LGR designer to ensure that the LGR is well
+ behaved.
+
+ A well-behaved LGR never assigns conditions on a reflexive variant,
+ as that is effectively no different from having a context on the code
+ point itself; the latter is preferred.
+
+
+
+
+
+
+
+Freytag Informational [Page 18]
+
+RFC 8228 Variant Rules August 2017
+
+
+ Finally, for symmetry to work as expected, the context must be
+ defined such that it is satisfied for both the original code point in
+ the context of the original label and for the variant code point in
+ the variant label. In other words, the context should be "stable
+ under variant substitution" anywhere in the label.
+
+ Positional contexts usually satisfy this last condition; for example,
+ a code point that interrupts a cursive connection would likely share
+ this property with any of its variants. However, as it is possible
+ in principle to define other kinds of contexts, it is necessary to
+ make sure that the LGR is well behaved in this aspect at the time the
+ LGR is designed.
+
+ Due to the difficulty in verifying these constraints mechanically, it
+ is essential that an LGR designer document the reasons why the LGR
+ can be expected to meet them and the details of the techniques used
+ to ensure that outcome. This information should be found in the
+ description element of the LGR.
+
+ In summary, conditional contexts can be useful for some cases, but
+ additional care must be taken to ensure that an LGR containing
+ conditional contexts is well behaved. LGR designers would be well
+ advised to avoid using conditional contexts and to prefer
+ unconditional rules whenever practical, even though it will
+ doubtlessly reduce the number of labels practically available.
+
+17. Variants for Sequences
+
+ Variant mappings can be defined between sequences or between a code
+ point and a sequence. For example, one might define a "blocked"
+ variant between the sequence "rn" and the code point "m" because they
+ are practically indistinguishable in common UI fonts.
+
+ Such variants are no different from variants defined between single
+ code points, except if a sequence is defined such that there is a
+ code point or shorter sequence that is a prefix (initial subsequence)
+ and both it and the remainder are also part of the repertoire. In
+ that case, it is possible to create duplicate variants with
+ conflicting dispositions.
+
+
+
+
+
+
+
+
+
+
+
+
+Freytag Informational [Page 19]
+
+RFC 8228 Variant Rules August 2017
+
+
+ The following shows such an example resulting in conflicting
+ reflexive variants:
+
+ A a--> C
+ AB x--> CD
+
+ where AB is a sequence with an initial subsequence of A. For
+ example, B might be a combining code point used in sequence AB. If B
+ only occurs in the sequence, there is no issue, but if B also occurs
+ by itself, for example:
+
+ B a--> D
+
+ then a label "AB" might correspond to either {A}{B}, that is, the two
+ code points, or {AB}, the sequence, where the curly braces show the
+ sequence boundaries as they would be applied during label validation
+ and variant mapping.
+
+ A label AB would then generate the "allocatable" variant label {C}{D}
+ and the "blocked" variant label {CD}, thus creating two variant
+ labels with conflicting dispositions.
+
+ For the example of a blocked variant between "m" and "rn" (and vice
+ versa), there is no issue as long as "r" and "n" do not have variant
+ mappings of their own, so that there cannot be multiple variant
+ labels for the same input. However, it is preferable to avoid
+ ambiguities altogether where possible.
+
+ The easiest way to avoid an ambiguous segmentation into sequences is
+ by never allowing both a sequence and all of its constituent parts
+ simultaneously as independent parts of the repertoire, for example,
+ by not defining B by itself as a member of the repertoire.
+
+ Sequences are often used for combining sequences that consist of a
+ base character B followed by one or more combining marks C. By
+ enumerating all sequences in which a certain combining mark is
+ expected and by not listing the combining mark by itself in the LGR,
+ the mark cannot occur outside of these specifically enumerated
+ contexts. In cases where enumeration is not possible or practicable,
+ other techniques can be used to prevent ambiguous segmentation, for
+ example, a context rule on code points that disallows B preceding C
+ in any label except as part of a predefined sequence or class of
+ sequences. The details of such techniques are outside the scope of
+ this document (see [RFC7940] for information on context rules for
+ code points).
+
+
+
+
+
+
+Freytag Informational [Page 20]
+
+RFC 8228 Variant Rules August 2017
+
+
+18. Corresponding XML Notation
+
+ The XML format defined in [RFC7940] corresponds fairly directly to
+ the notation used for variant mappings in this document. (There is
+ no notation in the RFC for variant type sets). In an LGR document, a
+ simple member of a repertoire that does not have any variants is
+ listed as:
+
+ <char cp="nnnn" />
+
+ where nnnn is the [UNICODE] code point value in the standard
+ uppercase hexadecimal notation padded to at least 4 digits and
+ without leading "U+". For a code point sequence of length 2, the XML
+ notation becomes:
+
+ <char cp="uuuu vvvvv" />
+
+ Variant mappings are defined by nesting <var> elements inside the
+ <char> element. For example, a variant relation of type "blocked"
+
+ C x--> X
+
+ is expressed as
+
+ <char cp="nnnn">
+ <var cp="mmmm" type="blocked" />
+ </char>
+
+
+ where "x-->" identifies a "blocked" type. (Other types include
+ "a-->" for "allocatable", for example. Here, nnnn and mmmm are the
+ [UNICODE] code point values for C and X, respectively. Either C or X
+ could be a code point sequence or a single code point.
+
+ A reflexive mapping is specified the same way, except that it always
+ uses the same code point value for both the <char> and <var> element,
+ for example:
+
+ X r-o--> X
+
+ would correspond to
+
+ <char cp="nnnn"><var cp="nnnn" type="out-of-repertoire-var" /></char>
+
+ Multiple <var> elements may be nested inside a single <char> element,
+ but their "cp" values must be distinct (unless attributes for context
+ rules are present and the combination of "cp" value and context
+ attributes are distinct).
+
+
+
+Freytag Informational [Page 21]
+
+RFC 8228 Variant Rules August 2017
+
+
+ <char cp="nnnn">
+ <var cp="kkkk" type="allocatable" />
+ <var cp="mmmm" type="blocked" />
+ </char>
+
+ A set of conditional variants like
+
+ final: C a--> K
+ !final: C x--> K
+
+ would correspond to
+
+ <var cp="kkkk" when="final" type="allocatable" />
+ <var cp="kkkk" not-when="final" type="blocked" />
+
+ where the string "final" references a name of a context rule.
+ Context rules are defined in [RFC7940]; they conceptually correspond
+ to regular expressions. The details of how to create and define
+ these rules are outside the scope of this document. If the label
+ matches the context defined in the rule, the variant mapping is valid
+ and takes part in further processing. Otherwise, it is invalid and
+ ignored. Using the "not-when" attribute inverts the sense of the
+ match. The two attributes are mutually exclusive.
+
+ A derivation of a variant label disposition
+
+ if "only-variants" = "s" or "b" => allocatable
+
+ is expressed as
+
+ <action disp="allocatable" only-variants= "s b" />
+
+ Instead of using "if" and "else if", the <action> elements implicitly
+ form a cascade, where the first action triggered defines the
+ disposition of the label. The order of action elements is thus
+ significant.
+
+ For the full specification of the XML format, see [RFC7940].
+
+19. IANA Considerations
+
+ This document does not require any IANA actions.
+
+
+
+
+
+
+
+
+
+Freytag Informational [Page 22]
+
+RFC 8228 Variant Rules August 2017
+
+
+20. Security Considerations
+
+ As described in [RFC7940], variants may be used as a tool to reduce
+ certain avenues of attack in security-relevant identifiers by
+ allowing certain labels to be "mutually exclusive or registered only
+ to the same user". However, if indiscriminately designed, variants
+ may themselves contribute to risks to the security or usability of
+ the identifiers, whether resulting from an ambiguous definition or
+ from allowing too many allocatable variants per label.
+
+ The information in this document is intended to allow the reader to
+ design a specification of an LGR that is "well behaved" with respect
+ to variants; as used here, this term refers to an LGR that is
+ predictable in its effects to the LGR author (and reviewer) and more
+ reliable in its implementation.
+
+ A well-behaved LGR is not merely one that can be expressed in
+ [RFC7940], but, in addition, it actively avoids certain edge cases
+ not prevented by the schema, such as those that would result in
+ ambiguities in the specification of the intended disposition for some
+ variant labels. By applying the additional considerations introduced
+ in this document, including adding certain declarations that are
+ optional under the schema and may not alter the results of processing
+ a label, such an LGR becomes easier to review and its implementations
+ easier to verify.
+
+ It should be noted that variants are an important part, but only a
+ part, of an LGR design. There are many other features of an LGR that
+ this document does not touch upon. Also, the question of whether to
+ define variants at all, or what labels are to be considered variants
+ of each other, is not addressed here.
+
+21. References
+
+21.1. Normative References
+
+ [RFC7940] Davies, K. and A. Freytag, "Representing Label Generation
+ Rulesets Using XML", RFC 7940, DOI 10.17487/RFC7940,
+ August 2016, <https://www.rfc-editor.org/info/rfc7940>.
+
+21.2. Informative References
+
+ [RFC1034] Mockapetris, P., "Domain names - concepts and facilities",
+ STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987,
+ <https://www.rfc-editor.org/info/rfc1034>.
+
+
+
+
+
+
+Freytag Informational [Page 23]
+
+RFC 8228 Variant Rules August 2017
+
+
+ [RFC1035] Mockapetris, P., "Domain names - implementation and
+ specification", STD 13, RFC 1035, DOI 10.17487/RFC1035,
+ November 1987, <https://www.rfc-editor.org/info/rfc1035>.
+
+ [RFC5890] Klensin, J., "Internationalized Domain Names for
+ Applications (IDNA): Definitions and Document Framework",
+ RFC 5890, DOI 10.17487/RFC5890, August 2010,
+ <https://www.rfc-editor.org/info/rfc5890>.
+
+ [RFC5891] Klensin, J., "Internationalized Domain Names in
+ Applications (IDNA): Protocol", RFC 5891,
+ DOI 10.17487/RFC5891, August 2010,
+ <https://www.rfc-editor.org/info/rfc5891>.
+
+ [RFC5892] Faltstrom, P., Ed., "The Unicode Code Points and
+ Internationalized Domain Names for Applications (IDNA)",
+ RFC 5892, DOI 10.17487/RFC5892, August 2010,
+ <https://www.rfc-editor.org/info/rfc5892>.
+
+ [RFC5893] Alvestrand, H., Ed. and C. Karp, "Right-to-Left Scripts
+ for Internationalized Domain Names for Applications
+ (IDNA)", RFC 5893, DOI 10.17487/RFC5893, August 2010,
+ <https://www.rfc-editor.org/info/rfc5893>.
+
+ [RFC5894] Klensin, J., "Internationalized Domain Names for
+ Applications (IDNA): Background, Explanation, and
+ Rationale", RFC 5894, DOI 10.17487/RFC5894, August 2010,
+ <https://www.rfc-editor.org/info/rfc5894>.
+
+ [UNICODE] The Unicode Consortium, "The Unicode Standard",
+ <http://www.unicode.org/versions/latest/>.
+
+Acknowledgments
+
+ Contributions that have shaped this document have been provided by
+ Marc Blanchet, Ben Campbell, Patrik Faltstrom, Scott Hollenbeck,
+ Mirja Kuehlewind, Sarmad Hussain, John Klensin, Alexey Melnikov,
+ Nicholas Ostler, Michel Suignard, Andrew Sullivan, Wil Tan, and
+ Suzanne Woolf.
+
+Author's Address
+
+ Asmus Freytag
+
+ Email: asmus@unicode.org
+
+
+
+
+
+
+Freytag Informational [Page 24]
+