diff options
author | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 |
---|---|---|
committer | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 |
commit | 4bfd864f10b68b71482b35c818559068ef8d5797 (patch) | |
tree | e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc6365.txt | |
parent | ea76e11061bda059ae9f9ad130a9895cc85607db (diff) |
doc: Add RFC documents
Diffstat (limited to 'doc/rfc/rfc6365.txt')
-rw-r--r-- | doc/rfc/rfc6365.txt | 2635 |
1 files changed, 2635 insertions, 0 deletions
diff --git a/doc/rfc/rfc6365.txt b/doc/rfc/rfc6365.txt new file mode 100644 index 0000000..e0cfa2d --- /dev/null +++ b/doc/rfc/rfc6365.txt @@ -0,0 +1,2635 @@ + + + + + + +Internet Engineering Task Force (IETF) P. Hoffman +Request for Comments: 6365 VPN Consortium +BCP: 166 J. Klensin +Obsoletes: 3536 September 2011 +Category: Best Current Practice +ISSN: 2070-1721 + + + Terminology Used in Internationalization in the IETF + +Abstract + + This document provides a list of terms used in the IETF when + discussing internationalization. The purpose is to help frame + discussions of internationalization in the various areas of the IETF + and to help introduce the main concepts to IETF participants. + +Status of This Memo + + This memo documents an Internet Best Current Practice. + + This document is a product of the Internet Engineering Task Force + (IETF). It represents the consensus of the IETF community. It has + received public review and has been approved for publication by the + Internet Engineering Steering Group (IESG). Further information on + BCPs is available in Section 2 of RFC 5741. + + Information about the current status of this document, any errata, + and how to provide feedback on it may be obtained at + http://www.rfc-editor.org/info/rfc6365. + +Copyright Notice + + Copyright (c) 2011 IETF Trust and the persons identified as the + document authors. All rights reserved. + + This document is subject to BCP 78 and the IETF Trust's Legal + Provisions Relating to IETF Documents + (http://trustee.ietf.org/license-info) in effect on the date of + publication of this document. Please review these documents + carefully, as they describe your rights and restrictions with respect + to this document. Code Components extracted from this document must + include Simplified BSD License text as described in Section 4.e of + the Trust Legal Provisions and are provided without warranty as + described in the Simplified BSD License. + + + + + + +Hoffman & Klensin Best Current Practice [Page 1] + +RFC 6365 Internationalization Terminology September 2011 + + +Table of Contents + + 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 + 1.1. Purpose of this Document . . . . . . . . . . . . . . . . . 3 + 1.2. Format of the Definitions in This Document . . . . . . . . 4 + 1.3. Normative Terminology . . . . . . . . . . . . . . . . . . 4 + 2. Fundamental Terms . . . . . . . . . . . . . . . . . . . . . . 5 + 3. Standards Bodies and Standards . . . . . . . . . . . . . . . . 10 + 3.1. Standards Bodies . . . . . . . . . . . . . . . . . . . . . 11 + 3.2. Encodings and Transformation Formats of ISO/IEC 10646 . . 13 + 3.3. Native CCSs and Charsets . . . . . . . . . . . . . . . . . 15 + 4. Character Issues . . . . . . . . . . . . . . . . . . . . . . . 16 + 4.1. Types of Characters . . . . . . . . . . . . . . . . . . . 20 + 4.2. Differentiation of Subsets . . . . . . . . . . . . . . . . 23 + 5. User Interface for Text . . . . . . . . . . . . . . . . . . . 24 + 6. Text in Current IETF Protocols . . . . . . . . . . . . . . . . 27 + 7. Terms Associated with Internationalized Domain Names . . . . . 31 + 7.1. IDNA Terminology . . . . . . . . . . . . . . . . . . . . . 31 + 7.2. Character Relationships and Variants . . . . . . . . . . . 32 + 8. Other Common Terms in Internationalization . . . . . . . . . . 33 + 9. Security Considerations . . . . . . . . . . . . . . . . . . . 36 + 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 37 + 10.1. Normative References . . . . . . . . . . . . . . . . . . . 37 + 10.2. Informative References . . . . . . . . . . . . . . . . . . 37 + Appendix A. Additional Interesting Reading . . . . . . . . . . . 41 + Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 42 + Appendix C. Significant Changes from RFC 3536 . . . . . . . . . . 42 + Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 + + + + + + + + + + + + + + + + + + + + + + + +Hoffman & Klensin Best Current Practice [Page 2] + +RFC 6365 Internationalization Terminology September 2011 + + +1. Introduction + + As the IETF Character Set Policy specification [RFC2277] summarizes: + "Internationalization is for humans. This means that protocols are + not subject to internationalization; text strings are." Many + protocols throughout the IETF use text strings that are entered by, + or are visible to, humans. Subject only to the limitations of their + own knowledge and facilities, it should be possible for anyone to + enter or read these text strings, which means that Internet users + must be able to enter text using typical input methods and have it be + displayed in any human language. Further, text containing any + character should be able to be passed between Internet applications + easily. This is the challenge of internationalization. + +1.1. Purpose of this Document + + This document provides a glossary of terms used in the IETF when + discussing internationalization. The purpose is to help frame + discussions of internationalization in the various areas of the IETF + and to help introduce the main concepts to IETF participants. + + Internationalization is discussed in many working groups of the IETF. + However, few working groups have internationalization experts. When + designing or updating protocols, the question often comes up "Should + we internationalize this?" (or, more likely, "Do we have to + internationalize this?"). + + This document gives an overview of internationalization terminology + as it applies to IETF standards work by lightly covering the many + aspects of internationalization and the vocabulary associated with + those topics. Some of the overview is somewhat tutorial in nature. + It is not meant to be a complete description of internationalization. + The definitions here SHOULD be used by IETF standards. IETF + standards that explicitly want to create different definitions for + the terms defined here can do so, but unless an alternate definition + is provided the definitions of the terms in this document apply. + IETF standards that have a requirement for different definitions are + encouraged, for clarity's sake, to find terms different than the ones + defined here. Some of the definitions in this document come from + earlier IETF documents and books. + + As in many fields, there is disagreement in the internationalization + community on definitions for many words. The topic of language + brings up particularly passionate opinions for experts and non- + experts alike. This document attempts to define terms in a way that + will be most useful to the IETF audience. + + + + + +Hoffman & Klensin Best Current Practice [Page 3] + +RFC 6365 Internationalization Terminology September 2011 + + + This document uses definitions from many documents that have been + developed inside and outside the IETF. The primary documents used + are: + + o ISO/IEC 10646 [ISOIEC10646] + + o The Unicode Standard [UNICODE] + + o W3C Character Model [CHARMOD] + + o IETF RFCs, including the Character Set Policy specification + [RFC2277] and the domain name internationalization standard + [RFC5890] + +1.2. Format of the Definitions in This Document + + In the body of this document, the source for the definition is shown + in angle brackets, such as "<ISOIEC10646>". Many definitions are + shown as "<RFC6365>", which means that the definitions were crafted + originally for this document. The angle bracket notation for the + source of definitions is different than the square bracket notation + used for references to documents, such as in the paragraph above; + these references are given in the reference sections of this + document. + + For some terms, there are commentary and examples after the + definitions. In those cases, the part before the angle brackets is + the definition that comes from the original source, and the part + after the angle brackets is commentary that is not a definition (such + as an example or further exposition). + + Examples in this document use the notation for code points and names + from the Unicode Standard [UNICODE] and ISO/IEC 10646 [ISOIEC10646]. + For example, the letter "a" may be represented as either "U+0061" or + "LATIN SMALL LETTER A". See RFC 5137 [RFC5137] for a description of + this notation. + +1.3. Normative Terminology + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", + "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this + document are to be interpreted as described in RFC 2119 [RFC2119]. + + + + + + + + + +Hoffman & Klensin Best Current Practice [Page 4] + +RFC 6365 Internationalization Terminology September 2011 + + +2. Fundamental Terms + + This section covers basic topics that are needed for almost anyone + who is involved with making IETF protocols more friendly to non-ASCII + text (see Section 4.2) and with other aspects of + internationalization. + + language + + A language is a way that humans communicate. The use of language + occurs in many forms, the most common of which are speech, + writing, and signing. <RFC6365> + + Some languages have a close relationship between the written and + spoken forms, while others have a looser relationship. The so- + called LTRU (Language Tag Registry Update) standards [RFC5646] + [RFC4647] discuss languages in more detail and provide identifiers + for languages for use in Internet protocols. Note that computer + languages are explicitly excluded from this definition. + + script + + A set of graphic characters used for the written form of one or + more languages. <ISOIEC10646> + + Examples of scripts are Latin, Cyrillic, Greek, Arabic, and Han + (the characters, often called ideographs after a subset of them, + used in writing Chinese, Japanese, and Korean). RFC 2277 + discusses scripts in detail. + + It is common for internationalization novices to mix up the terms + "language" and "script". This can be a problem in protocols that + differentiate the two. Almost all protocols that are designed (or + were re-designed) to handle non-ASCII text deal with scripts (the + written systems) or characters, while fewer actually deal with + languages. + + A single name can mean either a language or a script; for example, + "Arabic" is both the name of a language and the name of a script. + In fact, many scripts borrow their names from the names of + languages. Further, many scripts are used to write more than one + language; for example, the Russian and Bulgarian languages are + written in the Cyrillic script. Some languages can be expressed + using different scripts or were used with different scripts at + different times; the Mongolian language can be written in either + the Mongolian or Cyrillic scripts; Malay is primarily written in + Latin script today, but the earlier, Arabic-script-based, Jawa + form is still in use; and a number of languages were converted + + + +Hoffman & Klensin Best Current Practice [Page 5] + +RFC 6365 Internationalization Terminology September 2011 + + + from other scripts to Cyrillic in the first half of the last + century, some of which have switched again more recently. + Further, some languages are normally expressed with more than one + script at the same time; for example, the Japanese language is + normally expressed in the Kanji (Han), Katakana, and Hiragana + scripts in a single string of text. + + writing system + + A set of rules for using one or more scripts to write a particular + language. Examples include the American English writing system, + the British English writing system, the French writing system, and + the Japanese writing system. <UNICODE> + + character + + A member of a set of elements used for the organization, control, + or representation of data. <ISOIEC10646> + + There are at least three common definitions of the word + "character": + + * a general description of a text entity + + * a unit of a writing system, often synonymous with "letter" or + similar terms, but generalized to include digits and symbols of + various sorts + + * the encoded entity itself + + + When people talk about characters, they usually intend one of the + first two definitions. The term "character" is often abbreviated + as "char". + + A particular character is identified by its name, not by its + shape. A name may suggest a meaning, but the character may be + used for representing other meanings as well. A name may suggest + a shape, but that does not imply that only that shape is commonly + used in print, nor that the particular shape is associated only + with that name. + + coded character + + A character together with its coded representation. <ISOIEC10646> + + + + + + +Hoffman & Klensin Best Current Practice [Page 6] + +RFC 6365 Internationalization Terminology September 2011 + + + coded character set + + A coded character set (CCS) is a set of unambiguous rules that + establishes a character set and the relationship between the + characters of the set and their coded representation. + <ISOIEC10646> + + character encoding form + + A character encoding form is a mapping from a coded character set + (CCS) to the actual code units used to represent the data. + <UNICODE> + + repertoire + + The collection of characters included in a character set. Also + called a character repertoire. <UNICODE> + + glyph + + A glyph is an image of a character that can be displayed after + being imaged onto a display surface. <RFC6365> + + The Unicode Standard has a different definition that refers to an + abstract form that may represent different images when the same + character is rendered under different circumstances. + + glyph code + + A glyph code is a numeric code that refers to a glyph. Usually, + the glyphs contained in a font are referenced by their glyph code. + Glyph codes are local to a particular font; that is, a different + font containing the same glyphs may use different codes. <UNICODE> + + transcoding + + Transcoding is the process of converting text data from one + character encoding form to another. Transcoders work only at the + level of character encoding and do not parse the text. Note: + Transcoding may involve one-to-one, many-to-one, one-to-many, or + many-to-many mappings. Because some legacy mappings are glyphic, + they may not only be many-to-many, but also unordered: thus XYZ + may map to yxz. <CHARMOD> + + In this definition, "many-to-one" means a sequence of characters + mapped to a single character. The "many" does not mean + alternative characters that map to the single character. + + + + +Hoffman & Klensin Best Current Practice [Page 7] + +RFC 6365 Internationalization Terminology September 2011 + + + character encoding scheme + + A character encoding scheme (CES) is a character encoding form + plus byte serialization. There are many character encoding + schemes in Unicode, such as UTF-8 and UTF-16BE. <UNICODE> + + Some CESs are associated with a single CCS; for example, UTF-8 + [RFC3629] applies only to the identical CCSs of ISO/IEC 10646 and + Unicode. Other CESs, such as ISO 2022, are associated with many + CCSs. + + charset + + A charset is a method of mapping a sequence of octets to a + sequence of abstract characters. A charset is, in effect, a + combination of one or more CCSs with a CES. Charset names are + registered by the IANA according to procedures documented in + [RFC2978]. <RFC6365> + + Many protocol definitions use the term "character set" in their + descriptions. The terms "charset", or "character encoding scheme" + and "coded character set", are strongly preferred over the term + "character set" because "character set" has other definitions in + other contexts, particularly outside the IETF. When reading IETF + standards that use "character set" without defining the term, they + usually mean "a specific combination of one CCS with a CES", + particularly when they are talking about the "US-ASCII character + set". + + internationalization + + In the IETF, "internationalization" means to add or improve the + handling of non-ASCII text in a protocol. <RFC6365> A different + perspective, more appropriate to protocols that are designed for + global use from the beginning, is the definition used by W3C: + + "Internationalization is the design and development of a + product, application or document content that enables easy + localization for target audiences that vary in culture, region, + or language." [W3C-i18n-Def] + + Many protocols that handle text only handle one charset + (US-ASCII), or leave the question of what CCS and encoding are + used up to local guesswork (which leads, of course, to + interoperability problems). If multiple charsets are permitted, + they must be explicitly identified [RFC2277]. Adding non-ASCII + text to a protocol allows the protocol to handle more scripts, + hopefully all of the ones useful in the world. In today's world, + + + +Hoffman & Klensin Best Current Practice [Page 8] + +RFC 6365 Internationalization Terminology September 2011 + + + that is normally best accomplished by allowing Unicode encoded in + UTF-8 only, thereby shifting conversion issues away from + individual choices. + + localization + + The process of adapting an internationalized application platform + or application to a specific cultural environment. In + localization, the same semantics are preserved while the syntax + may be changed. [FRAMEWORK] + + Localization is the act of tailoring an application for a + different language or script or culture. Some internationalized + applications can handle a wide variety of languages. Typical + users only understand a small number of languages, so the program + must be tailored to interact with users in just the languages they + know. + + The major work of localization is translating the user interface + and documentation. Localization involves not only changing the + language interaction, but also other relevant changes such as + display of numbers, dates, currency, and so on. The better + internationalized an application is, the easier it is to localize + it for a particular language and character encoding scheme. + + Localization is rarely an IETF matter, and protocols that are + merely localized, even if they are serially localized for several + locations, are generally considered unsatisfactory for the global + Internet. + + Do not confuse "localization" with "locale", which is described in + Section 8 of this document. + + i18n, l10n + + These are abbreviations for "internationalization" and + "localization". <RFC6365> + + "18" is the number of characters between the "i" and the "n" in + "internationalization", and "10" is the number of characters + between the "l" and the "n" in "localization". + + + + + + + + + + +Hoffman & Klensin Best Current Practice [Page 9] + +RFC 6365 Internationalization Terminology September 2011 + + + multilingual + + The term "multilingual" has many widely varying definitions and + thus is not recommended for use in standards. Some of the + definitions relate to the ability to handle international + characters; other definitions relate to the ability to handle + multiple charsets; and still others relate to the ability to + handle multiple languages. <RFC6365> + + displaying and rendering text + + To display text, a system puts characters on a visual display + device such as a screen or a printer. To render text, a system + analyzes the character input to determine how to display the text. + The terms "display" and "render" are sometimes used + interchangeably. Note, however, that text might be rendered as + audio and/or tactile output, such as in systems that have been + designed for people with visual disabilities. <RFC6365> + + Combining characters modify the display of the character (or, in + some cases, characters) that precede them. When rendering such + text, the display engine must either find the glyph in the font + that represents the base character and all of the combining + characters, or it must render the combination itself. Such + rendering can be straightforward, but it is sometimes complicated + when the combining marks interact with each other, such as when + there are two combining marks that would appear above the same + character. Formatting characters can also change the way that a + renderer would display text. Rendering can also be difficult for + some scripts that have complex display rules for base characters, + such as Arabic and Indic scripts. + +3. Standards Bodies and Standards + + This section describes some of the standards bodies and standards + that appear in discussions of internationalization in the IETF. This + is an incomplete and possibly over-full list; listing too few bodies + or standards can be just as politically dangerous as listing too + many. Note that there are many other bodies that deal with + internationalization; however, few if any of them appear commonly in + IETF standards work. + + + + + + + + + + +Hoffman & Klensin Best Current Practice [Page 10] + +RFC 6365 Internationalization Terminology September 2011 + + +3.1. Standards Bodies + + ISO and ISO/IEC JTC 1 + + The International Organization for Standardization has been + involved with standards for characters since before the IETF was + started. ISO is a non-governmental group made up of national + bodies. Most of ISO's work in information technology is performed + jointly with a similar body, the International Electrotechnical + Commission (IEC) through a joint committee known as "JTC 1". ISO + and ISO/IEC JTC 1 have many diverse standards in the international + characters area; the one that is most used in the IETF is commonly + referred to as "ISO/IEC 10646", sometimes with a specific date. + ISO/IEC 10646 describes a CCS that covers almost all known written + characters in use today. + + ISO/IEC 10646 is controlled by the group known as "ISO/IEC JTC 1/ + SC 2 WG2", often called "SC2/WG2" or "WG2" for short. ISO + standards go through many steps before being finished, and years + often go by between changes to the base ISO/IEC 10646 standard + although amendments are now issued to track Unicode changes. + Information on WG2, and its work products, can be found at + <http://www.dkuug.dk/JTC1/SC2/WG2/>. Information on SC2, and its + work products, can be found at <http://www.iso.org/iso/ + standards_development/technical_committees/ + list_of_iso_technical_committees/ + iso_technical_committee.htm?commid=45050> + + The standard comes as a base part and a series of attachments or + amendments. It is available in PDF form for downloading or in a + CD-ROM version. One example of how to cite the standard is given + in [RFC3629]. Any standard that cites ISO/IEC 10646 needs to + evaluate how to handle the versioning problem that is relevant to + the protocol's needs. + + ISO is responsible for other standards that might be of interest + to protocol developers concerned about internationalization. + ISO 639 [ISO639] specifies the names of languages and forms part + of the basis for the IETF's Language Tag work [RFC5646]. ISO 3166 + [ISO3166] specifies the names and code abbreviations for countries + and territories and is used in several protocols and databases + including names for country-code top level domain names. The + responsibilities of ISO TC 46 on Information and Documentation + <http://www.iso.org/iso/standards_development/ + technical_committees/list_of_iso_technical_committees/ + iso_technical_committee.htm?commid=48750> include a series of + standards for transliteration of various languages into Latin + characters. + + + +Hoffman & Klensin Best Current Practice [Page 11] + +RFC 6365 Internationalization Terminology September 2011 + + + Another relevant ISO group was JTC 1/SC22/WG20, which was + responsible for internationalization in JTC 1, such as for + international string ordering. Information on WG20, and its work + products, can be found at <http://www.dkuug.dk/jtc1/sc22/wg20/>. + The specific tasks of SC22/WG20 were moved from SC22 into SC2, and + there has been little significant activity since that occurred. + + Unicode Consortium + + The second important group for international character standards + is the Unicode Consortium. The Unicode Consortium is a trade + association of companies, governments, and other groups interested + in promoting the Unicode Standard [UNICODE]. The Unicode Standard + is a CCS whose repertoire and code points are identical to + ISO/IEC 10646. The Unicode Consortium has added features to the + base CCS that make it more useful in protocols, such as defining + attributes for each character. Examples of these attributes + include case conversion and numeric properties. + + The actual technical and definitional work of the Unicode + Consortium is done in the Unicode Technical Committee (UTC). The + terms "UTC" and "Unicode Consortium" are often treated, + imprecisely, as synonymous in the IETF. + + The Unicode Consortium publishes addenda to the Unicode Standard + as Unicode Technical Reports. There are many types of technical + reports at various stages of maturity. The Unicode Standard and + affiliated technical reports can be found at + <http://www.unicode.org/>. + + A reciprocal agreement between the Unicode Consortium and + ISO/IEC JTC 1/SC 2 provides for ISO/IEC 10646 and The Unicode + Standard to track each other for definitions of characters and + assignments of code points. Updates, often in the form of + amendments, to the former sometimes lag updates to the latter for + a short period, but the gap has rarely been significant in recent + years. + + At the time that the IETF character set policy [RFC2277] was + established and the first version of this terminology + specification was published, there was a strong preference in the + IETF community for references to ISO/IEC 10646 (rather than + Unicode) when possible. That preference largely reflected a more + general IETF preference for referencing established open + international standards over specifications from consortia. + However, the Unicode definitions of character properties and + classes are not part of ISO/IEC 10646. Because IETF + specifications are increasingly dependent on those definitions + + + +Hoffman & Klensin Best Current Practice [Page 12] + +RFC 6365 Internationalization Terminology September 2011 + + + (for example, see the explanation in Section 4.2) and the Unicode + specifications are freely available online in convenient machine- + readable form, the IETF's preference has shifted to referencing + the Unicode Standard. The latter is especially important when + version consistency between code points (either standard) and + Unicode properties (Unicode only) is required. + + World Wide Web Consortium (W3C) + + This group created and maintains the standard for XML, the markup + language for text that has become very popular. XML has always + been fully internationalized so that there is no need for a new + version to handle international text. However, in some + circumstances, XML files may be sensitive to differences among + Unicode versions. + + local and regional standards organizations + + Just as there are many native CCSs and charsets, there are many + local and regional standards organizations to create and support + them. Common examples of these are ANSI (United States), CEN/ISSS + (Europe), JIS (Japan), and SAC (China). + +3.2. Encodings and Transformation Formats of ISO/IEC 10646 + + Characters in the ISO/IEC 10646 CCS can be expressed in many ways. + Historically, "encoding forms" are both direct addressing methods, + while "transformation formats" are methods for expressing encoding + forms as bits on the wire. That distinction has mostly disappeared + in recent years. + + Documents that discuss characters in the ISO/IEC 10646 CCS often need + to list specific characters. RFC 5137 describes the common methods + for doing so in IETF documents, and these practices have been adopted + by many other communities as well. + + Basic Multilingual Plane (BMP) + + The BMP is composed of the first 2^16 code points in ISO/IEC 10646 + and contains almost all characters in contemporary use. The BMP + is also called "Plane 0". + + UCS-2 and UCS-4 + + UCS-2 and UCS-4 are the two encoding forms historically defined + for ISO/IEC 10646. UCS-2 addresses only the BMP. Because many + useful characters (such as many Han characters) have been defined + outside of the BMP, many people consider UCS-2 to be obsolete. + + + +Hoffman & Klensin Best Current Practice [Page 13] + +RFC 6365 Internationalization Terminology September 2011 + + + UCS-4 addresses the entire range of code points from ISO/IEC 10646 + (by agreement between ISO/IEC JTC 1 SC2 and the Unicode + Consortium, a range from 0..0x10FFFF) as 32-bit values with zero + padding to the left. UCS-4 is identical to UTF-32BE (without use + of a BOM (see below)); UTF-32BE is now the preferred term. + + UTF-8 + + UTF-8 [RFC3629] is the preferred encoding for IETF protocols. + Characters in the BMP are encoded as one, two, or three octets. + Characters outside the BMP are encoded as four octets. Characters + from the US-ASCII repertoire have the same on-the-wire + representation in UTF-8 as they do in US-ASCII. The IETF-specific + definition of UTF-8 in RFC 3629 is identical to that in recent + versions of the Unicode Standard (e.g., in Section 3.9 of Version + 6.0 [UNICODE]). + + UTF-16, UTF-16BE, and UTF-16LE + + UTF-16, UTF-16BE, and UTF-16LE, three transformation formats + described in [RFC2781] and defined in The Unicode Standard + (Sections 3.9 and 16.8 of Version 6.0), are not required by any + IETF standards, and are thus used much less often in protocols + than UTF-8. Characters in the BMP are always encoded as two + octets, and characters outside the BMP are encoded as four octets + using a "surrogate pair" arrangement. The latter is not part of + UCS-2, marking the difference between UTF-16 and UCS-2. The three + UTF-16 formats differ based on the order of the octets and the + presence or absence of a special lead-in ordering identifier + called the "byte order mark" or "BOM". + + UTF-32 + + The Unicode Consortium and ISO/IEC JTC 1 have defined UTF-32 as a + transformation format that incorporates the integer code point + value right-justified in a 32-bit field. As with UTF-16, the byte + order mark (BOM) can be used and UTF-32BE and UTF-32LE are + defined. UTF-32 and UCS-4 are essentially equivalent and the + terms are often used interchangeably. + + SCSU and BOCU-1 + + The Unicode Consortium has defined an encoding, SCSU [UTR6], which + is designed to offer good compression for typical text. A + different encoding that is meant to be MIME-friendly, BOCU-1, is + described in [UTN6]. Although compression is attractive, as + opposed to UTF-8, neither of these (at the time of this writing) + has attracted much interest. + + + +Hoffman & Klensin Best Current Practice [Page 14] + +RFC 6365 Internationalization Terminology September 2011 + + + The compression provided as a side effect of the Punycode + algorithm [RFC3492] is heavily used in some contexts, especially + IDNA [RFC5890], but imposes some restrictions. (See also + Section 7.) + +3.3. Native CCSs and Charsets + + Before ISO/IEC 10646 was developed, many countries developed their + own CCSs and charsets. Some of these were adopted into international + standards for the relevant scripts or writing systems. Many dozen of + these are in common use on the Internet today. Examples include + ISO 8859-5 for Cyrillic and Shift-JIS for Japanese scripts. + + The official list of the registered charset names for use with IETF + protocols is maintained by IANA and can be found at + <http://www.iana.org/assignments/character-sets>. The list contains + preferred names and aliases. Note that this list has historically + contained many errors, such as names that are in fact not charsets or + references that do not give enough detail to reliably map names to + charsets. + + Probably the most well-known native CCS is ASCII [US-ASCII]. This + CCS is used as the basis for keywords and parameter names in many + IETF protocols, and as the sole CCS in numerous IETF protocols that + have not yet been internationalized. ASCII became the basis for + ISO/IEC 646 which, in turn, formed the basis for many national and + international standards, such as the ISO 8859 series, that mix Basic + Latin characters with characters from another script. + + It is important to note that, strictly speaking, "ASCII" is a CCS and + repertoire, not an encoding. The encoding used for ASCII in IETF + protocols involves the 7-bit integer ASCII code point right-justified + in an 8-bit field and is sometimes described as the "Network Virtual + Terminal" or "NVT" encoding [RFC5198]. Less formally, "ASCII" and + "NVT" are often used interchangeably. However, "non-ASCII" refers + only to characters outside the ASCII repertoire and is not linked to + a specific encoding. See Section 4.2. + + A Unicode publication describes issues involved in mapping character + data between charsets, and an XML format for mapping table data + [UTR22]. + + + + + + + + + + +Hoffman & Klensin Best Current Practice [Page 15] + +RFC 6365 Internationalization Terminology September 2011 + + +4. Character Issues + + This section contains terms and topics that are commonly used in + character handling and therefore are of concern to people adding non- + ASCII text handling to protocols. These topics are standardized + outside the IETF. + + code point + + A value in the codespace of a repertoire. For all common + repertoires developed in recent years, code point values are + integers (code points for ASCII and its immediate descendants were + defined in terms of column and row positions of a table). + + combining character + + A member of an identified subset of the coded character set of + ISO/IEC 10646 intended for combination with the preceding non- + combining graphic character, or with a sequence of combining + characters preceded by a non-combining character. Combining + characters are inherently non-spacing. <ISOIEC10646> + + composite sequence or combining character sequence + + A sequence of graphic characters consisting of a non-combining + character followed by one or more combining characters. A graphic + symbol for a composite sequence generally consists of the + combination of the graphic symbols of each character in the + sequence. The Unicode Standard often uses the term "combining + character sequence" to refer to composite sequences. A composite + sequence is not a character and therefore is not a member of the + repertoire of ISO/IEC 10646. <ISOIEC10646> However, Unicode now + assigns names to some such sequences especially when the names are + required to match terminology in other standards [UAX34]. + + In some CCSs, some characters consist of combinations of other + characters. For example, the letter "a with acute" might be a + combination of the two characters "a" and "combining acute", or it + might be a combination of the three characters "a", a non- + destructive backspace, and an acute. In the same or other CCSs, + it might be available as a single code point. The rules for + combining two or more characters are called "composition rules", + and the rules for taking apart a character into other characters + are called "decomposition rules". The result of decomposition is + called a "decomposed character"; the result of composition is + usually a "precomposed character". + + + + + +Hoffman & Klensin Best Current Practice [Page 16] + +RFC 6365 Internationalization Terminology September 2011 + + + normalization + + Normalization is the transformation of data to a normal form, for + example, to unify spelling. <UNICODE> + + Note that the phrase "unify spelling" in the definition above does + not mean unifying different strings with the same meaning as words + (such as "color" and "colour"). Instead, it means unifying + different character sequences that are intended to form the same + composite characters, such as "<n><combining tilde>" and "<n with + tilde>" (where "<n>" is U+006E, "<combining tilde>" is U+0303, and + "<n with tilde>" is U+00F1). + + The purpose of normalization is to allow two strings to be + compared for equivalence. The strings "<a><n><combining + tilde><o>" and "<a><n with tilde><o>" would be shown identically + on a text display device. If a protocol designer wants those two + strings to be considered equivalent during comparison, the + protocol must define where normalization occurs. + + The terms "normalization" and "canonicalization" are often used + interchangeably. Generally, they both mean to convert a string of + one or more characters into another string based on standardized + rules. However, in Unicode, "canonicalization" or similar terms + are used to refer to a particular type of normalization + equivalence ("canonical equivalence" in contrast to "compatibility + equivalence"), so the term should be used with some care. Some + CCSs allow multiple equivalent representations for a written + string; normalization selects one among multiple equivalent + representations as a base for reference purposes in comparing + strings. In strings of text, these rules are usually based on + decomposing combined characters or composing characters with + combining characters. Unicode Standard Annex #15 [UTR15] + describes the process and many forms of normalization in detail. + Normalization is important when comparing strings to see if they + are the same. + + The Unicode NFC and NFD normalizations support canonical + equivalence; NFKC and NFKD support canonical and compatibility + equivalence. + + + + + + + + + + + +Hoffman & Klensin Best Current Practice [Page 17] + +RFC 6365 Internationalization Terminology September 2011 + + + case + + Case is the feature of certain alphabets where the letters have + two (or occasionally more) distinct forms. These forms, which may + differ markedly in shape and size, are called the uppercase letter + (also known as capital or majuscule) and the lowercase letter + (also known as small or minuscule). Case mapping is the + association of the uppercase and lowercase forms of a letter. + <UNICODE> + + There is usually (but not always) a one-to-one mapping between the + same letter in the two cases. However, there are many examples of + characters that exist in one case but for which there is no + corresponding character in the other case or for which there is a + special mapping rule, such as the Turkish dotless "i", some Greek + characters with modifiers, and characters like the German Sharp S + (Eszett) and Greek Final Sigma that traditionally do not have + uppercase forms. Case mapping can even be dependent on locale or + language. Converting text to have only a single case, primarily + for comparison purposes, is called "case folding". Because of the + various unusual cases, case mapping can be quite controversial and + some case folding algorithms even more so. For example, some + programming languages such as Java have case-folding algorithms + that are locale-sensitive; this makes those algorithms incredibly + resource-intensive and makes them act differently depending on the + location of the system at the time the algorithm is used. + + sorting and collation + + Collating is the process of ordering units of textual information. + Collation is usually specific to a particular language or even to + a particular application or locale. It is sometimes known as + alphabetizing, although alphabetization is just a special case of + sorting and collation. <UNICODE> + + Collation is concerned with the determination of the relative + order of any particular pair of strings, and algorithms concerned + with collation focus on the problem of providing appropriate + weighted keys for string values, to enable binary comparison of + the key values to determine the relative ordering of the strings. + + The relative orders of letters in collation sequences can differ + widely based on the needs of the system or protocol defining the + collation order. For example, even within ASCII characters, there + are two common and very different collation orders: "A, a, B, + b,..." and "A, B, C, ..., Z, a, b,...", with additional variations + for lowercase first and digits before and after letters. + + + + +Hoffman & Klensin Best Current Practice [Page 18] + +RFC 6365 Internationalization Terminology September 2011 + + + In practice, it is rarely necessary to define a collation sequence + for characters drawn from different scripts, but arranging such + sequences so as to not surprise users is usually particularly + problematic. + + Sorting is the process of actually putting data records into + specified orders, according to criteria for comparison between the + records. Sorting can apply to any kind of data (including textual + data) for which an ordering criterion can be defined. Algorithms + concerned with sorting focus on the problem of performance (in + terms of time, memory, or other resources) in actually putting the + data records into the desired order. + + A sorting algorithm for string data can be internationalized by + providing it with the appropriate collation-weighted keys + corresponding to the strings to be ordered. + + Many processes have a need to order strings in a consistent + (sorted) sequence. For only a few CCS/CES combinations, there is + an obvious sort order that can be applied without reference to the + linguistic meaning of the characters: the code point order is + sufficient for sorting. That is, the code point order is also the + order that a person would use in sorting the characters. For many + CCS/CES combinations, the code point order would make no sense to + a person and therefore is not useful for sorting if the results + will be displayed to a person. + + Code point order is usually not how any human educated by a local + school system expects to see strings ordered; if one orders to the + expectations of a human, one has a "language-specific" or "human + language" sort. Sorting to code point order will seem + inconsistent if the strings are not normalized before sorting + because different representations of the same character will sort + differently. This problem may be smaller with a language-specific + sort. + + code table + + A code table is a table showing the characters allocated to the + octets in a code. <ISOIEC10646> + + Code tables are also commonly called "code charts". + + + + + + + + + +Hoffman & Klensin Best Current Practice [Page 19] + +RFC 6365 Internationalization Terminology September 2011 + + +4.1. Types of Characters + + The following definitions of types of characters do not clearly + delineate each character into one type, nor do they allow someone to + accurately predict what types would apply to a particular character. + The definitions are intended for application designers to help them + think about the many (sometimes confusing) properties of text. + + alphabetic + + An informative Unicode property. Characters that are the primary + units of alphabets and/or syllabaries, whether combining or non- + combining. This includes composite characters that are canonical + equivalents to a combining character sequence of an alphabetic + base character plus one or more combining characters: letter + digraphs; contextual variants of alphabetic characters; ligatures + of alphabetic characters; contextual variants of ligatures; + modifier letters; letterlike symbols that are compatibility + equivalents of single alphabetic letters; and miscellaneous letter + elements. <UNICODE> + + ideographic + + Any symbol that primarily denotes an idea (or meaning) in contrast + to a sound (or pronunciation), for example, a symbol showing a + telephone or the Han characters used in Chinese, Japanese, and + Korean. <UNICODE> + + While Unicode and many other systems use this term to refer to all + Han characters, strictly speaking not all of those characters are + actually ideographic. Some are pictographic (such as the + telephone example above), some are used phonetically, and so on. + However, the convention is to describe the script as ideographic + as contrasted to alphabetic. + + digit or number + + All modern writing systems use decimal digits in some form; some + older ones use non-positional or other systems. Different scripts + may have their own digits. Unicode distinguishes between numbers + and other kinds of characters by assigning a special General + Category value to them and subdividing that value to distinguish + between decimal digits, letter digits, and other digits. <UNICODE> + + + + + + + + +Hoffman & Klensin Best Current Practice [Page 20] + +RFC 6365 Internationalization Terminology September 2011 + + + punctuation + + Characters that separate units of text, such as sentences and + phrases, thus clarifying the meaning of the text. The use of + punctuation marks is not limited to prose; they are also used in + mathematical and scientific formulae, for example. <UNICODE> + + symbol + + One of a set of characters other than those used for letters, + digits, or punctuation, and representing various concepts + generally not connected to written language use per se. <RFC6365> + + Examples of symbols include characters for mathematical operators, + symbols for optical character recognition (OCR), symbols for box- + drawing or graphics, as well as symbols for dingbats, arrows, + faces, and geometric shapes. Unicode has a property that + identifies symbol characters. + + nonspacing character + + A combining character whose positioning in presentation is + dependent on its base character. It generally does not consume + space along the visual baseline in and of itself. <UNICODE> + + A combining acute accent (U+0301) is an example of a nonspacing + character. + + diacritic + + A mark applied or attached to a symbol to create a new symbol that + represents a modified or new value. They can also be marks + applied to a symbol irrespective of whether they change the value + of that symbol. In the latter case, the diacritic usually + represents an independent value (for example, an accent, tone, or + some other linguistic information). Also called diacritical mark + or diacritical. <UNICODE> + + control character + + The 65 characters in the ranges U+0000..U+001F and U+007F..U+009F. + The basic space character, U+0020, is often considered as a + control character as well, making the total number 66. They are + also known as control codes. In terminology adopted by Unicode + from ASCII and the ISO 8859 standards, these codes are treated as + belonging to three ranges: "C0" (for U+0000..U+001F), "C1" (for + U+0080...U+009F), and the single control character "DEL" (U+007F). + <UNICODE> + + + +Hoffman & Klensin Best Current Practice [Page 21] + +RFC 6365 Internationalization Terminology September 2011 + + + Occasionally, in other vocabularies, the term "control character" + is used to describe any character that does not normally have an + associated glyph; it is also sometimes used for device control + sequences [ISO6429]. Neither of those usages is appropriate to + internationalization terminology in the IETF. + + formatting character + + Characters that are inherently invisible but that have an effect + on the surrounding characters. <UNICODE> + + Examples of formatting characters include characters for + specifying the direction of text and characters that specify how + to join multiple characters. + + compatibility character or compatibility variant + + A graphic character included as a coded character of ISO/IEC 10646 + primarily for compatibility with existing coded character sets. + <ISOIEC10646)> + + The Unicode definition of compatibility charter also includes + characters that have been incorporated for other reasons. Their + list includes several separate groups of characters included for + compatibility purposes: halfwidth and fullwidth characters used + with East Asian scripts, Arabic contextual forms (e.g., initial or + final forms), some ligatures, deprecated formatting characters, + variant forms of characters (or even copies of them) for + particular uses (e.g., phonetic or mathematical applications), + font variations, CJK compatibility ideographs, and so on. For + additional information and the separate term "compatibility + decomposable character", see the Unicode standard. + + For example, U+FF01 (FULLWIDTH EXCLAMATION MARK) was included for + compatibility with Asian charsets that include full-width and + half-width ASCII characters. + + Some efforts in the IETF have concluded that it would be useful to + support mapping of some groups of compatibility equivalents and + not others (e.g., supporting or mapping width variations while + preserving or rejecting mathematical variations). See the IDNA + Mapping document [RFC5895] for one example. + + + + + + + + + +Hoffman & Klensin Best Current Practice [Page 22] + +RFC 6365 Internationalization Terminology September 2011 + + +4.2. Differentiation of Subsets + + Especially as existing IETF standards are internationalized, it is + necessary to describe collections of characters including especially + various subsets of Unicode. Because Unicode includes ways to code + substantially all characters in contemporary use, subsets of the + Unicode repertoire can be a useful tool for defining these + collections as repertoires independent of specific Unicode coding. + + However specific collections are defined, it is important to remember + that, while older CCSs such as ASCII and the ISO 8859 family are + close-ended and fixed, Unicode is open-ended, with new character + definitions, and often new scripts, being added every year or so. + So, while, e.g., an ASCII subset, such as "uppercase letters", can be + specified as a range of code points (4/1 to 5/10 for that example), + similar definitions for Unicode either have to be specified in terms + of Unicode properties or are very dependent on Unicode versions (and + the relevant version must be identified in any specification). See + the IDNA code point specification [RFC5892] for an example of + specification by combinations of properties. + + Some terms are commonly used in the IETF to define character ranges + and subsets. Some of these are imprecise and can cause confusion if + not used carefully. + + non-ASCII + + The term "non-ASCII" strictly refers to characters other than + those that appear in the ASCII repertoire, independent of the CCS + or encoding used for them. In practice, if a repertoire such as + that of Unicode is established as context, "non-ASCII" refers to + characters in that repertoire that do not appear in the ASCII + repertoire. "Outside the ASCII repertoire" and "outside the ASCII + range" are practical, and more precise, synonyms for "non-ASCII". + + letters + + The term "letters" does not have an exact equivalent in the + Unicode standard. Letters are generally characters that are used + to write words, but that means very different things in different + languages and cultures. + + + + + + + + + + +Hoffman & Klensin Best Current Practice [Page 23] + +RFC 6365 Internationalization Terminology September 2011 + + +5. User Interface for Text + + Although the IETF does not standardize user interfaces, many + protocols make assumptions about how a user will enter or see text + that is used in the protocol. Internationalization challenges + assumptions about the type and limitations of the input and output + devices that may be used with applications that use various + protocols. It is therefore useful to consider how users typically + interact with text that might contain one or more non-ASCII + characters. + + input methods + + An input method is a mechanism for a person to enter text into an + application. <RFC6365> + + Text can be entered into a computer in many ways. Keyboards are + by far the most common device used, but many characters cannot be + entered on typical computer keyboards in a single stroke. Many + operating systems come with system software that lets users input + characters outside the range of what is allowed by keyboards. + + For example, there are dozens of different input methods for Han + characters in Chinese, Japanese, and Korean. Some start with + phonetic input through the keyboard, while others use the number + of strokes in the character. Input methods are also needed for + scripts that have many diacritics, such as European or Vietnamese + characters that have two or three diacritics on a single + alphabetic character. + + The term "input method editor" (IME) is often used generically to + describe the tools and software used to deal with input of + characters on a particular system. + + rendering rules + + A rendering rule is an algorithm that a system uses to decide how + to display a string of text. <RFC6365> + + Some scripts can be directly displayed with fonts, where each + character from an input stream can simply be copied from a glyph + system and put on the screen or printed page. Other scripts need + rules that are based on the context of the characters in order to + render text for display. + + + + + + + +Hoffman & Klensin Best Current Practice [Page 24] + +RFC 6365 Internationalization Terminology September 2011 + + + Some examples of these rendering rules include: + + * Scripts such as Arabic (and many others), where the form of the + letter changes depending on the adjacent letters, whether the + letter is standing alone, at the beginning of a word, in the + middle of a word, or at the end of a word. The rendering rules + must choose between two or more glyphs. + + * Scripts such as the Indic scripts, where consonants may change + their form if they are adjacent to certain other consonants or + may be displayed in an order different from the way they are + stored and pronounced. The rendering rules must choose between + two or more glyphs. + + * Arabic and Hebrew scripts, where the order of the characters + displayed are changed by the bidirectional properties of the + alphabetic and other characters and with right-to-left and + left-to-right ordering marks. The rendering rules must choose + the order that characters are displayed. + + * Some writing systems cannot have their rendering rules suitably + defined using mechanisms that are now defined in the Unicode + Standard. None of those languages are in active non-scholarly + use today. + + * Many systems use a special rendering rule when they lack a font + or other mechanism for rendering a particular character + correctly. That rule typically involves substitution of a + small open box or a question mark for the missing character. + See "undisplayable character" below. + + graphic symbol + + A graphic symbol is the visual representation of a graphic + character or of a composite sequence. <ISOIEC10646> + + font + + A font is a collection of glyphs used for the visual depiction of + character data. A font is often associated with a set of + parameters (for example, size, posture, weight, and serifness), + which, when set to particular values, generates a collection of + imagable glyphs. <UNICODE> + + The term "font" is often used interchangeably with "typeface". As + historically used in typography, a typeface is a family of one or + more fonts that share a common general design. For example, + "Times Roman" is actually a typeface, with a collection of fonts + + + +Hoffman & Klensin Best Current Practice [Page 25] + +RFC 6365 Internationalization Terminology September 2011 + + + such as "Times Roman Bold", "Times Roman Medium", "Times Roman + Italic", and so on. Some sources even consider different type + sizes within a typeface to be different fonts. While those + distinctions are rarely important for internationalization + purposes, there are exceptions. Those writing specifications + should be very careful about definitions in cases in which the + exceptions might lead to ambiguity. + + bidirectional display + + The process or result of mixing left-to-right oriented text and + right-to-left oriented text in a single line is called + bidirectional display, often abbreviated as "bidi". <UNICODE> + + Most of the world's written languages are displayed left-to-right. + However, many widely-used written languages such as ones based on + the Hebrew or Arabic scripts are displayed primarily right-to-left + (numerals are a common exception in the modern scripts). Right- + to-left text often confuses protocol writers because they have to + keep thinking in terms of the order of characters in a string in + memory, an order that might be different from what they see on the + screen. (Note that some languages are written both horizontally + and vertically and that some historical ones use other display + orderings.) + + Further, bidirectional text can cause confusion because there are + formatting characters in ISO/IEC 10646 that cause the order of + display of text to change. These explicit formatting characters + change the display regardless of the implicit left-to-right or + right-to-left properties of characters. Text that might contain + those characters typically requires careful processing before + being sorted or compared for equality. + + It is common to see strings with text in both directions, such as + strings that include both text and numbers, or strings that + contain a mixture of scripts. + + Unicode has a long and incredibly detailed algorithm for + displaying bidirectional text [UAX9]. + + undisplayable character + + A character that has no displayable form. <RFC6365> + + For instance, the zero-width space (U+200B) cannot be displayed + because it takes up no horizontal space. Formatting characters + such as those for setting the direction of text are also + undisplayable. Note, however, that every character in [UNICODE] + + + +Hoffman & Klensin Best Current Practice [Page 26] + +RFC 6365 Internationalization Terminology September 2011 + + + has a glyph associated with it, and that the glyphs for + undisplayable characters are enclosed in a dashed square as an + indication that the actual character is undisplayable. + + The property of a character that causes it to be undisplayable is + intrinsic to its definition. Undisplayable characters can never + be displayed in normal text (the dashed square notation is used + only in special circumstances). Printable characters whose + Unicode definitions are associated with glyphs that cannot be + rendered on a particular system are not, in this sense, + undisplayable. + + writing style + + Conventions of writing the same script in different styles. + <RFC6365> + + Different communities using the script may find text in different + writing styles difficult to read and possibly unintelligible. For + example, the Perso-Arabic Nastalique writing style and the Arabic + Naskh writing style both use the Arabic script but have very + different renderings and are not mutually comprehensible. Writing + styles may have significant impact on internationalization; for + example, the Nastalique writing style requires significantly more + line height than Naskh writing style. + +6. Text in Current IETF Protocols + + Many IETF protocols started off being fully internationalized, while + others have been internationalized as they were revised. In this + process, IETF members have seen patterns in the way that many + protocols use text. This section describes some specific protocol + interactions with text. + + protocol elements + + Protocol elements are uniquely named parts of a protocol. + <RFC6365> + + Almost every protocol has named elements, such as "source port" in + TCP. In some protocols, the names of the elements (or text tokens + for the names) are transmitted within the protocol. For example, + in SMTP and numerous other IETF protocols, the names of the verbs + are part of the command stream. The names are thus part of the + protocol standard. The names of protocol elements are not + normally seen by end users, and it is rarely appropriate to + internationalize protocol element names (even while the elements + themselves can be internationalized). + + + +Hoffman & Klensin Best Current Practice [Page 27] + +RFC 6365 Internationalization Terminology September 2011 + + + name spaces + + A name space is the set of valid names for a particular item, or + the syntactic rules for generating these valid names. <RFC6365> + + Many items in Internet protocols use names to identify specific + instances or values. The names may be generated (by some + prescribed rules), registered centrally (e.g., such as with IANA), + or have a distributed registration and control mechanism, such as + the names in the DNS. + + on-the-wire encoding + + The encoding and decoding used before and after transmission over + the network is often called the "on-the-wire" (or sometimes just + "wire") format. <RFC6365> + + Characters are identified by code points. Before being + transmitted in a protocol, they must first be encoded as bits and + octets. Similarly, when characters are received in a + transmission, they have been encoded, and a protocol that needs to + process the individual characters needs to decode them before + processing. + + parsed text + + Text strings that have been analyzed for subparts. <RFC6365> + + In some protocols, free text in text fields might be parsed. For + example, many mail user agents (MUAs) will parse the words in the + text of the Subject: field to attempt to thread based on what + appears after the "Re:" prefix. + + Such conventions are very sensitive to localization. If, for + example, a form like "Re:" is altered by an MUA to reflect the + language of the sender or recipient, a system that subsequently + does threading may not recognize the replacement term as a + delimiter string. + + charset identification + + Specification of the charset used for a string of text. <RFC6365> + + Protocols that allow more than one charset to be used in the same + place should require that the text be identified with the + appropriate charset. Without this identification, a program + looking at the text cannot definitively discern the charset of the + text. Charset identification is also called "charset tagging". + + + +Hoffman & Klensin Best Current Practice [Page 28] + +RFC 6365 Internationalization Terminology September 2011 + + + language identification + + Specification of the human language used for a string of text. + <RFC6365> + + Some protocols (such as MIME and HTTP) allow text that is meant + for machine processing to be identified with the language used in + the text. Such identification is important for machine processing + of the text, such as by systems that render the text by speaking + it. Language identification is also called "language tagging". + The IETF "LTRU" standards [RFC5646] and [RFC4647] provide a + comprehensive model for language identification. + + MIME + + MIME (Multipurpose Internet Mail Extensions) is a message format + that allows for textual message bodies and headers in character + sets other than US-ASCII in formats that require ASCII (most + notably RFC 5322, the standard for Internet mail headers + [RFC5322]). MIME is described in RFCs 2045 through 2049, as well + as more recent RFCs. <RFC6365> + + transfer encoding syntax + + A transfer encoding syntax (TES) (sometimes called a transfer + encoding scheme) is a reversible transform of already encoded data + that is represented in one or more character encoding schemes. + <RFC6365> + + TESs are useful for encoding types of character data into another + format, usually for allowing new types of data to be transmitted + over legacy protocols. The main examples of TESs used in the IETF + include Base64 and quoted-printable. MIME identifies the transfer + encoding syntax for body parts as a Content-transfer-encoding, + occasionally abbreviated C-T-E. + + Base64 + + Base64 is a transfer encoding syntax that allows binary data to be + represented by the ASCII characters A through Z, a through z, 0 + through 9, +, /, and =. It is defined in [RFC2045]. <RFC6365> + + quoted printable + + Quoted printable is a transfer encoding syntax that allows strings + that have non-ASCII characters mixed in with mostly ASCII + printable characters to be somewhat human readable. It is + described in [RFC2047]. <RFC6365> + + + +Hoffman & Klensin Best Current Practice [Page 29] + +RFC 6365 Internationalization Terminology September 2011 + + + The quoted printable syntax is generally considered to be a + failure at being readable. It is jokingly referred to as "quoted + unreadable". + + XML + + XML (which is an approximate abbreviation for Extensible Markup + Language) is a popular method for structuring text. XML text that + is not encoded as UTF-8 is explicitly tagged with charsets, and + all text in XML consists only of Unicode characters. The + specification for XML can be found at <http://www.w3.org/XML/>. + <RFC6365> + + ASN.1 text formats + + The ASN.1 data description language has many formats for text + data. The formats allow for different repertoires and different + encodings. Some of the formats that appear in IETF standards + based on ASN.1 include IA5String (all ASCII characters), + PrintableString (most ASCII characters, but missing many + punctuation characters), BMPString (characters from ISO/IEC 10646 + plane 0 in UTF-16BE format), UTF8String (just as the name + implies), and TeletexString (also called T61String). + + ASCII-compatible encoding (ACE) + + Starting in 1996, many ASCII-compatible encoding schemes (which + are actually transfer encoding syntaxes) have been proposed as + possible solutions for internationalizing host names and some + other purposes. Their goal is to be able to encode any string of + ISO/IEC 10646 characters using the preferred syntax for domain + names (as described in STD 13). At the time of this writing, only + the ACE produced by Punycode [RFC3492] has become an IETF + standard. + + The choice of ACE forms to internationalize legacy protocols must + be made with care as it can cause some difficult side effects + [RFC6055]. + + LDH label + + The classical label form used in the DNS and most applications + that call on it, albeit with some additional restrictions, + reflects the early syntax of "hostnames" [RFC0952] and limits + those names to ASCII letters, digits, and embedded hyphens. The + hostname syntax is identical to that described as the "preferred + name syntax" in Section 3.5 of RFC 1034 [RFC1034] as modified by + + + + +Hoffman & Klensin Best Current Practice [Page 30] + +RFC 6365 Internationalization Terminology September 2011 + + + RFC 1123 [RFC1123]. LDH labels are defined in a more restrictive + and precise way for internationalization contexts as part of the + IDNA2008 specification [RFC5890]. + +7. Terms Associated with Internationalized Domain Names + +7.1. IDNA Terminology + + The current specification for Internationalized Domain Names (IDNs), + known formally as Internationalized Domain Names for Applications or + IDNA, is referred to in the IETF and parts of the broader community + as "IDNA2008" and consists of several documents. Section 2.3 of the + first of those documents, commonly known as "IDNA2008 Definitions" + [RFC5890] provides definitions and introduces some specialized terms + for differentiating among types of DNS labels in an IDN context. + Those terms are listed in the table below; see RFC 5890 for the + specific definitions if needed. + + ACE Prefix + A-label + Domain Name Slot + IDNA-valid string + Internationalized Domain Name (IDN) + Internationalized Label + LDH Label + Non-Reserved LDH label (NR-LDH label) + U-label + + Two additional terms entered the IETF's vocabulary as part of the + earlier IDN effort [RFC3490] (IDNA2003): + + Stringprep + + Stringprep [RFC3454] provides a model and character tables for + preparing and handling internationalized strings. It was used + in the original IDN specification (IDNA2003) via a profile + called "Nameprep" [RFC3491]. It is no longer in use in IDNA, + but continues to be used in profiles by a number of other + protocols. <RFC6365> + + Punycode + + This is the name of the algorithm [RFC3492] used to convert + otherwise-valid IDN labels from native-character strings + expressed in Unicode to an ASCII-compatible encoding (ACE). + Strictly speaking, the term applies to the algorithm only. In + practice, it is widely, if erroneously, used to refer to + strings that the algorithm encodes. + + + +Hoffman & Klensin Best Current Practice [Page 31] + +RFC 6365 Internationalization Terminology September 2011 + + +7.2. Character Relationships and Variants + + The term "variant" was introduced into the IETF i18n vocabulary with + the JET recommendations [RFC3743]. As used there, it referred + strictly to the relationship between Traditional Chinese characters + and their Simplified equivalents. The JET recommendations provided a + model for identifying these pairs of characters and labels that used + them. Specific recommendations for variant handling for the Chinese + language were provided in a follow-up document [RFC4713]. + + In more recent years, the term has also been used to describe other + collections of characters or strings that might be perceived as + equivalent. Those collections have involved one or more of several + categories of characters and labels containing them including: + + o "visually similar" or "visually confusable" characters. These may + be limited to characters in different scripts, characters in a + single script, or both, and may be those that can appear to be + alike even when high-distinguishability reference fonts are used + or under various circumstances that may involve malicious choices + of typefaces or other ways to trick user perception. Trivial + examples include ASCII "l" and "1" and Latin and Cyrillic "a". + + o Characters assigned more than one Unicode code point because of + some special property. These characters may be considered "the + same" for some purposes and different for others (or by other + users). One of the most commonly cited examples is the Arabic + YEH, which is encoded more than once because some of its shapes + are different across different languages. Another example are the + Greek lowercase sigma and final sigma: if the latter were viewed + purely as a positional presentation variation on the former, it + should not have been assigned a separate code point. + + o Numerals and labels including them. Unlike letters, the "meaning" + of decimal digits is clear and unambiguous regardless of the + script with which they are associated. Some scripts are routinely + used almost interchangeably with European digits and digits native + to that script. The Arabic script has two sets of digits + (U+0660..U+0669 and U+06F0..U=06F9), written identically for zero + through three and seven through nine but differently for four + through six; European digits predominate in other areas. + Substitution of digits with the same numeric value in labels may + give rise to another type of variant. + + o Orthographic differences within a language. Many languages have + alternate choices of spellings or spellings that differ by locale. + Users of those languages generally recognize the spellings as + equivalent, at least as much so as the variations described above. + + + +Hoffman & Klensin Best Current Practice [Page 32] + +RFC 6365 Internationalization Terminology September 2011 + + + Examples include "color" and "colour" in English, German words + spelled with o-umlaut or "oe", and so on. Some of these + relationships may also create other types of language-specific + perceived differences that do not exist for other languages using + the same script. For example, in Arabic language usage at the end + of words, ARABIC LETTER TEH MARBUTA (U+0629) and ARABIC LETTER HEH + (U+0647) are differently shaped (one has 2 dots in top of it), but + they are used interchangeably in writing: they "sound" similar + when pronounced at the end of phrase, and hence the LETTER TEH + MARBUTA sometimes is written as LETTER HEH and the two are + considered "confusable" in that context. + + The term "variant" as used in this section should also not be + confused with other uses of the term in this document or in Unicode + terminology (e.g., those in Section 4.1 above). If the term is to be + used at all, context should clearly distinguish among these different + uses and, in particular, between variant characters and variant + labels. Local text should identify which meaning, or combination of + meanings, are intended. + +8. Other Common Terms in Internationalization + + This is a hodge-podge of other terms that have appeared in + internationalization discussions in the IETF. + + locale + + Locale is the user-specific location and cultural information + managed by a computer. <RFC6365> + + Because languages and orthographic conventions differ from country + to country (and even region to region within a country), the + locale of the user can often be an important factor. Typically, + the locale information for a user includes the language(s) used. + + Locale issues go beyond character use, and can include things such + as the display format for currency, dates, and times. Some + locales (especially the popular "C" and "POSIX" locales) do not + include language information. + + It should be noted that there are many thorny, unsolved issues + with locale. For example, should text be viewed using the locale + information of the person who wrote the text, information that + would apply to the location of the system storing or providing the + text, or the person viewing it? What if the person viewing it is + traveling to different locations? Should only some of the locale + information affect creation and editing of text? + + + + +Hoffman & Klensin Best Current Practice [Page 33] + +RFC 6365 Internationalization Terminology September 2011 + + + Latin characters + + "Latin characters" is a not-precise term for characters + historically related to ancient Greek script as modified in the + Roman Republic and Empire and currently used throughout the world. + <RFC6365> + + The base Latin characters are a subset of the ASCII repertoire and + have been augmented by many single and multiple diacritics and + quite a few other characters. ISO/IEC 10646 encodes the Latin + characters in including ranges U+0020..U+024F and U+1E00..U+1EFF. + + Because "Latin characters" is used in different contexts to refer + to the letters from the ASCII repertoire, the subset of those + characters used late in the Roman Republic period, or the + different subset used to write Latin in medieval times, the entire + ASCII repertoire, all of the code points in the extended Latin + script as defined by Unicode, and other collections, the term + should be avoided in IETF specifications when possible. + Similarly, "Basic Latin" should not be used as a synonym for + "ASCII". + + romanization + + The transliteration of a non-Latin script into Latin characters. + <RFC6365> + + Because of their widespread use, Latin characters (or graphemes + constructed from them) are often used to try to write text in + languages that didn't previously have writing systems or whose + writing systems were originally based on different scripts. For + example, there are two popular romanizations of Chinese: Wade- + Giles and Pinyin, the latter of which is by far more common today. + Many romanization systems are inexact and do not give perfect + round-trip mappings between the native script and the Latin + characters. + + CJK characters and Han characters + + The ideographic characters used in Chinese, Japanese, Korean, and + traditional Vietnamese writing systems are often called "CJK + characters" after the initial letters of the language names in + English. They are also called "Han characters", after the term in + Chinese that is often used for these characters. <RFC6365> + + + + + + + +Hoffman & Klensin Best Current Practice [Page 34] + +RFC 6365 Internationalization Terminology September 2011 + + + Note that Han characters do not include the phonetic characters + used in the Japanese and Korean languages. Users of the term "CJK + characters" may or may not assume those additional characters are + included. + + In ISO/IEC 10646, the Han characters were "unified", meaning that + each set of Han characters from Japanese, Chinese, and/or Korean + that had the same origin was assigned a single code point. The + positive result of this was that many fewer code points were + needed to represent Han; the negative result of this was that + characters that people who write the three languages think are + different have the same code point. There is a great deal of + disagreement on the nature, the origin, and the severity of the + problems caused by Han unification. + + translation + + The process of conveying the meaning of some passage of text in + one language, so that it can be expressed equivalently in another + language. <RFC6365> + + Many language translation systems are inexact and cannot be + applied repeatedly to go from one language to another to another. + + transliteration + + The process of representing the characters of an alphabetical or + syllabic system of writing by the characters of a conversion + alphabet. <RFC6365> + + Many script transliterations are exact, and many have perfect + round-trip mappings. The notable exception to this is + romanization, described above. Transliteration involves + converting text expressed in one script into another script, + generally on a letter-by-letter basis. There are many official + and unofficial transliteration standards, most notably those from + ISO TC 46 and the U.S. Library of Congress. + + transcription + + The process of systematically writing the sounds of some passage + of spoken language, generally with the use of a technical phonetic + alphabet (usually Latin-based) or other systematic transcriptional + orthography. Transcription also sometimes refers to the + conversion of written text into a transcribed form, based on the + sound of the text as if it had been spoken. <RFC6365> + + + + + +Hoffman & Klensin Best Current Practice [Page 35] + +RFC 6365 Internationalization Terminology September 2011 + + + Unlike transliterations, which are generally designed to be round- + trip convertible, transcriptions of written material are almost + never round-trip convertible to their original form, at least + without some supplemental information. + + regular expressions + + Regular expressions provide a mechanism to select specific strings + from a set of character strings. Regular expressions are a + language used to search for text within strings, and possibly + modify the text found with other text. <RFC6365> + + Pattern matching for text involves being able to represent one or + more code points in an abstract notation, such as searching for + all capital Latin letters or all punctuation. The most common + mechanism in IETF protocols for naming such patterns is the use of + regular expressions. There is no single regular expression + language, but there are numerous very similar dialects that are + not quite consistent with each other. + + The Unicode Consortium has a good discussion about how to adapt + regular expression engines to use Unicode. [UTR18] + + private use character + + ISO/IEC 10646 code points from U+E000 to U+F8FF, U+F0000 to + U+FFFFD, and U+100000 to U+10FFFD are available for private use. + This refers to code points of the standard whose interpretation is + not specified by the standard and whose use may be determined by + private agreement among cooperating users. <UNICODE> + + The use of these "private use" characters is defined by the + parties who transmit and receive them, and is thus not appropriate + for standardization. (The IETF has a long history of private use + names for things such as "x-" names in MIME types, charsets, and + languages. Most of the experience with these has been quite + negative, with many implementors assuming that private use names + are in fact public and long-lived.) + +9. Security Considerations + + Security is not discussed directly in this document. While the + definitions here have no direct effect on security, they are used in + many security contexts. For example, authentication usually involves + comparing two tokens, and one or both of those tokens might be text; + thus, some methods of comparison might involve using some of the + internationalization concepts for which terms are defined in this + document. + + + +Hoffman & Klensin Best Current Practice [Page 36] + +RFC 6365 Internationalization Terminology September 2011 + + + Having said that, other RFCs dealing with internationalization have + security consideration descriptions that may be useful to the reader + of this document. In particular, the security considerations in RFC + 3454, RFC 3629, RFC 4013 [RFC4013], and RFC 5890 go into a fair + amount of detail. + +10. References + +10.1. Normative References + + [ISOIEC10646] ISO/IEC, "ISO/IEC 10646:2011. International Standard + -- Information technology - Universal Multiple-Octet + Coded Character Set (UCS)", 2011. + + [RFC2047] Moore, K., "MIME (Multipurpose Internet Mail + Extensions) Part Three: Message Header Extensions for + Non-ASCII Text", RFC 2047, November 1996. + + [UNICODE] The Unicode Consortium, "The Unicode Standard, + Version 6.0", (Mountain View, CA: The Unicode + Consortium, 2011. ISBN 978-1-936213-01-6). + <http://www.unicode.org/versions/Unicode6.0.0/>. + +10.2. Informative References + + [CHARMOD] W3C, "Character Model for the World Wide Web 1.0", + 2005, <http://www.w3.org/TR/charmod/>. + + [FRAMEWORK] ISO/IEC, "ISO/IEC TR 11017:1997(E). Information + technology - Framework for internationalization, + prepared by ISO/IEC JTC 1/SC 22/WG 20", 1997. + + [ISO3166] ISO, "ISO 3166-1:2006 - Codes for the representation + of names of countries and their subdivisions -- Part + 1: Country codes", 2006. + + [ISO639] ISO, "ISO 639-1:2002 - Code for the representation of + names of languages - Part 1: Alpha-2 code", 2002. + + [ISO6429] ISO/IEC, "ISO/IEC, "ISO/IEC 6429:1992. Information + technology -- Control functions for coded character + sets"", ISO/IEC 6429:1992, 1992. + + [RFC0952] Harrenstien, K., Stahl, M., and E. Feinler, "DoD + Internet host table specification", RFC 952, + October 1985. + + + + + +Hoffman & Klensin Best Current Practice [Page 37] + +RFC 6365 Internationalization Terminology September 2011 + + + [RFC1034] Mockapetris, P., "Domain names - concepts and + facilities", STD 13, RFC 1034, November 1987. + + [RFC1123] Braden, R., "Requirements for Internet Hosts - + Application and Support", STD 3, RFC 1123, + October 1989. + + [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet + Mail Extensions (MIME) Part One: Format of Internet + Message Bodies", RFC 2045, November 1996. + + [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate + Requirement Levels", BCP 14, RFC 2119, March 1997. + + [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and + Languages", BCP 18, RFC 2277, January 1998. + + [RFC2781] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of + ISO 10646", RFC 2781, February 2000. + + [RFC2978] Freed, N. and J. Postel, "IANA Charset Registration + Procedures", BCP 19, RFC 2978, October 2000. + + [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of + Internationalized Strings ("stringprep")", RFC 3454, + December 2002. + + [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, + "Internationalizing Domain Names in Applications + (IDNA)", RFC 3490, March 2003. + + [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep + Profile for Internationalized Domain Names (IDN)", + RFC 3491, March 2003. + + [RFC3492] Costello, A., "Punycode: A Bootstring encoding of + Unicode for Internationalized Domain Names in + Applications (IDNA)", RFC 3492, March 2003. + + [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO + 10646", STD 63, RFC 3629, November 2003. + + [RFC3743] Konishi, K., Huang, K., Qian, H., and Y. Ko, "Joint + Engineering Team (JET) Guidelines for + Internationalized Domain Names (IDN) Registration and + Administration for Chinese, Japanese, and Korean", + RFC 3743, April 2004. + + + + +Hoffman & Klensin Best Current Practice [Page 38] + +RFC 6365 Internationalization Terminology September 2011 + + + [RFC4013] Zeilenga, K., "SASLprep: Stringprep Profile for User + Names and Passwords", RFC 4013, February 2005. + + [RFC4647] Phillips, A. and M. Davis, "Matching of Language + Tags", BCP 47, RFC 4647, September 2006. + + [RFC4713] Lee, X., Mao, W., Chen, E., Hsu, N., and J. Klensin, + "Registration and Administration Recommendations for + Chinese Domain Names", RFC 4713, October 2006. + + [RFC5137] Klensin, J., "ASCII Escaping of Unicode Characters", + BCP 137, RFC 5137, February 2008. + + [RFC5198] Klensin, J. and M. Padlipsky, "Unicode Format for + Network Interchange", RFC 5198, March 2008. + + [RFC5322] Resnick, P., Ed., "Internet Message Format", + RFC 5322, October 2008. + + [RFC5646] Phillips, A. and M. Davis, "Tags for Identifying + Languages", BCP 47, RFC 5646, September 2009. + + [RFC5890] Klensin, J., "Internationalized Domain Names for + Applications (IDNA): Definitions and Document + Framework", RFC 5890, August 2010. + + [RFC5892] Faltstrom, P., "The Unicode Code Points and + Internationalized Domain Names for Applications + (IDNA)", RFC 5892, August 2010. + + [RFC5895] Resnick, P. and P. Hoffman, "Mapping Characters for + Internationalized Domain Names in Applications (IDNA) + 2008", RFC 5895, September 2010. + + [RFC6055] Thaler, D., Klensin, J., and S. Cheshire, "IAB + Thoughts on Encodings for Internationalized Domain + Names", RFC 6055, February 2011. + + [UAX34] The Unicode Consortium, "Unicode Standard Annex #34: + Unicode Named Character Sequences", 2010, + <http://www.unicode.org/reports/tr34>. + + [UAX9] The Unicode Consortium, "Unicode Standard Annex #9: + Unicode Bidirectional Algorithm", 2010, + <http://www.unicode.org/reports/tr9>. + + + + + + +Hoffman & Klensin Best Current Practice [Page 39] + +RFC 6365 Internationalization Terminology September 2011 + + + [US-ASCII] ANSI, "Coded Character Set -- 7-bit American Standard + Code for Information Interchange, ANSI X3.4-1986", + 1986. + + [UTN6] The Unicode Consortium, "Unicode Technical Note #5: + BOCU-1: MIME-Compatible Unicode Compression", 2006, + <http://www.unicode.org/notes/tn6/>. + + [UTR15] The Unicode Consortium, "Unicode Standard Annex #15: + Unicode Normalization Forms", 2010, + <http://www.unicode.org/reports/tr15>. + + [UTR18] The Unicode Consortium, "Unicode Standard Annex #18: + Unicode Regular Expressions", 2008, + <http://www.unicode.org/reports/tr18>. + + [UTR22] The Unicode Consortium, "Unicode Technical Standard + #22: Unicode Character Mapping Markup Language", + 2009, <http://www.unicode.org/reports/tr22>. + + [UTR6] The Unicode Consortium, "Unicode Technical Standard + #6: A Standard Compression Scheme for Unicode", 2005, + <http://www.unicode.org/reports/tr6>. + + [W3C-i18n-Def] W3C, "Localization vs. Internationalization", + September 2010, <http://www.w3.org/International/ + questions/qa-i18n.en>. + + + + + + + + + + + + + + + + + + + + + + + + +Hoffman & Klensin Best Current Practice [Page 40] + +RFC 6365 Internationalization Terminology September 2011 + + +Appendix A. Additional Interesting Reading + + Barry, Randall, ed. ALA-LC Romanization Tables. Washington: U.S. + Library of Congress, 1997. ISBN 0844409405 + + Coulmas, Florian. Blackwell Encyclopedia of Writing Systems. + Oxford: Blackwell Publishers, 1999. ISBN 063121481X + + Dalby, Andrew. Dictionary of Languages: The Definitive Reference to + More than 400 Languages. New York: Columbia University Press, 2004. + ISBN 978-0231115698 + + Daniels, Peter, and William Bright. The World's Writing Systems. + New York: Oxford University Press, 1996. ISBN 0195079930 + + DeFrancis, John. The Chinese Language: Fact and Fantasy. Honolulu: + University of Hawaii Press, 1984. ISBN 0-8284-085505 and + 0-8248-1058-6 + + Drucker, Joanna. The Alphabetic Labyrinth: The Letters in History + and Imagination. London: Thames & Hudson, 1995. ISBN 0-500-28068-1 + + Fazzioli, Edoardo. Chinese Calligraphy. New York: Abbeville Press, + 1986, 1987 (English translation). ISBN 0-89659-774-1 + + Hooker, J.T., et al. Reading the Past: Ancient Writing from + Cuneiform to the Alphabet. London: British Museum Press, 1990. ISBN + 0-7141-8077-7 + + Lunde, Ken. CJKV Information Processing. Sebastopol, CA: O'Reilly & + Assoc., 1999. ISBN 1-56592-224-7 + + Nakanishi, Akira. Writing Systems of the World. Rutland, VT: + Charles E. Tuttle Company, 1980. ISBN 0804816549 + + Robinson, Andrew. The Story of Writing: Alphabets, Hieroglyphs, & + Pictograms. London: Thames & Hudson, 1995, 2000. ISBN 0-500-28156-4 + + Sacks, David. Language Visible. New York: Broadway Books (a + division of Random House, Inc.), 2003. ISBN 0-7679-1172-5 + + + + + + + + + + + +Hoffman & Klensin Best Current Practice [Page 41] + +RFC 6365 Internationalization Terminology September 2011 + + +Appendix B. Acknowledgements + + The definitions in this document come from many sources, including a + wide variety of IETF documents. + + James Seng contributed to the initial outline of RFC 3536. Harald + Alvestrand and Martin Duerst made extensive useful comments on early + versions. Others who contributed to the development of RFC 3536 + include Dan Kohn, Jacob Palme, Johan van Wingen, Peter Constable, + Yuri Demchenko, Susan Harris, Zita Wenzel, John Klensin, Henning + Schulzrinne, Leslie Daigle, Markus Scherer, and Ken Whistler. + + Abdulaziz Al-Zoman, Tim Bray, Frank Ellermann, Antonio Marko, JFC + Morphin, Sarmad Hussain, Mykyta Yevstifeyev, Ken Whistler, and others + identified important issues with, or made specific suggestions for, + this new version. + +Appendix C. Significant Changes from RFC 3536 + + This document mostly consists of additions to RFC 3536. The + following is a list of the most significant changes. + + o Changed the document's status to BCP. + + o Commonly used synonyms added to several descriptions and indexed. + + o A list of terms defined and used in IDNA2008 was added, with a + pointer to RFC 5890. Those definitions have not been repeated in + this document. + + o The much-abused term "variant" is now discussed in some detail. + + o A discussion of different subsets of the Unicode repertoire was + added as Section 4.2 and associated definitions were included. + + o Added a new term, "writing style". + + o Discussions of case-folding and mapping were expanded. + + o Minor edits were made to some section titles and a number of other + editorial improvements were made. + + o The discussion of control codes was updated to include additional + information and clarify that "control code" and "control + character" are synonyms. + + o Many terms were clarified to reflect contemporary usage. + + + + +Hoffman & Klensin Best Current Practice [Page 42] + +RFC 6365 Internationalization Terminology September 2011 + + + o The index to terms by section in RFC 3536 was replaced by an index + to pages containing considerably more terms. + + o The acknowledgments were updated. + + o Some of the references were updated. + + o The supplemental reading list was expanded somewhat. + +Index + + A + A-label 31 + ACE 30, 31 + ACE Prefix 31 + alphabetic 20 + ANSI 13 + ASCII 15 + ASCII-compatible encoding 30, 31 + ASN.1 text formats 30 + + B + Base64 29 + Basic Multilingual Plane 13 + bidi 26 + bidirectional display 26 + BMP 13 + BMPString 30 + BOCU-1 14 + BOM 14 + byte order mark 14 + + C + C-T-E 29 + case 18 + CCS 7 + CEN/ISSS 13 + character 6 + character encoding form 7 + character encoding scheme 8 + character repertoire 7 + charset 8 + charset identification 28 + CJK characters 34 + code chart 19 + code point 16 + code table 19 + coded character 6 + + + +Hoffman & Klensin Best Current Practice [Page 43] + +RFC 6365 Internationalization Terminology September 2011 + + + coded character set 7 + collation 18 + combining character 16 + combining character sequence 16 + compatibility character 22 + compatibility variant 22 + composite sequence 16 + content-transfer-encoding 29 + control character 21 + control code 21 + control sequence 22 + + D + decomposed character 16 + diacritic 21 + displaying and rendering text 10 + Domain Name Slot 31 + + E + encoding forms 13 + + F + font 25 + formatting character 22 + + G + glyph 7 + glyph code 7 + graphic symbol 25 + + H + Han characters 34 + + I + i18n 9 + IA5String 30 + ideographic 20 + IDN 31 + IDNA 31 + IDNA-valid string 31 + IDNA2003 31 + IDNA2008 31 + IME 24 + input method editor 24 + input methods 24 + internationalization 8 + Internationalized Domain Name 31 + Internationalized Label 31 + + + +Hoffman & Klensin Best Current Practice [Page 44] + +RFC 6365 Internationalization Terminology September 2011 + + + ISO 11 + ISO 639 11 + ISO 3166 11 + ISO 8859 15 + ISO TC 46 11 + + J + JIS 13 + JTC 1 11 + + L + l10n 9 + language 5 + language identification 29 + Latin characters 34 + LDH Label 30 + letters 23 + Local and regional standards organizations 13 + locale 33 + localization 9 + + M + MIME 29 + multilingual 10 + + N + name spaces 28 + Nameprep 31 + NFC 17 + NFD 17 + NFKC 17 + NFKD 17 + non-ASCII 23 + nonspacing character 21 + normalization 17 + NR-LDH label 31 + NVT 15 + + O + on-the-wire encoding 28 + + + + + + + + + + + +Hoffman & Klensin Best Current Practice [Page 45] + +RFC 6365 Internationalization Terminology September 2011 + + + P + parsed text 28 + precomposed character 16 + PrintableString 30 + private use charater 36 + protocol elements 27 + punctuation 21 + Punycode 30, 31 + + Q + quoted-printable 29 + + R + regular expressions 36 + rendering rules 24 + repertoire 7 + romanization 34 + + S + SAC 13 + script 5 + SCSU 14 + sorting 18 + Stringprep 31 + surrogate pair 14 + symbol 21 + + T + T61String 30 + TeletexString 30 + TES 29 + transcoding 7 + transcription 35 + transfer encoding syntax 29 + transformation formats 13 + translation 35 + transliteration 34, 35 + typeface 25 + + U + U-label 31 + UCS-2 13 + UCS-4 13 + undisplayable character 26 + Unicode Consortium 12 + US-ASCII 15 + UTC 12 + UTF-8 14 + + + +Hoffman & Klensin Best Current Practice [Page 46] + +RFC 6365 Internationalization Terminology September 2011 + + + UTF-16 14 + UTF-16BE 14 + UTF-16LE 14 + UTF-32 14 + UTF8String 30 + + V + variant 32 + + W + W3C 13 + World Wide Web Consortium 13 + writing style 27 + writing system 6 + + X + XML 13, 30 + +Authors' Addresses + + Paul Hoffman + VPN Consortium + + EMail: paul.hoffman@vpnc.org + + + John C Klensin + 1770 Massachusetts Ave, Ste 322 + Cambridge, MA 02140 + USA + + Phone: +1 617 245 1457 + EMail: john+ietf@jck.com + + + + + + + + + + + + + + + + + + +Hoffman & Klensin Best Current Practice [Page 47] + |