diff options
author | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 |
---|---|---|
committer | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 |
commit | 4bfd864f10b68b71482b35c818559068ef8d5797 (patch) | |
tree | e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc3536.txt | |
parent | ea76e11061bda059ae9f9ad130a9895cc85607db (diff) |
doc: Add RFC documents
Diffstat (limited to 'doc/rfc/rfc3536.txt')
-rw-r--r-- | doc/rfc/rfc3536.txt | 1683 |
1 files changed, 1683 insertions, 0 deletions
diff --git a/doc/rfc/rfc3536.txt b/doc/rfc/rfc3536.txt new file mode 100644 index 0000000..4156440 --- /dev/null +++ b/doc/rfc/rfc3536.txt @@ -0,0 +1,1683 @@ + + + + + + +Network Working Group P. Hoffman +Request for Comments: 3536 IMC & VPNC +Category: Informational May 2003 + + + Terminology Used in Internationalization in the IETF + +Status of this Memo + + This memo provides information for the Internet community. It does + not specify an Internet standard of any kind. Distribution of this + memo is unlimited. + +Copyright Notice + + Copyright (C) The Internet Society (2003). All Rights Reserved. + +Abstract + + This document provides a glossary of terms used in the IETF when + discussing internationalization. The purpose is to help frame + discussions of internationalization in the various areas of the IETF + and to help introduce the main concepts to IETF participants. + +Table of Contents + + 1. Introduction................................................... 2 + 1.1 Purpose of this document.................................... 2 + 1.2 Format of the definitions in this document.................. 3 + 2. Fundamental Terms.............................................. 3 + 3. Standards Bodies and Standards................................. 8 + 3.1 Standards bodies............................................ 8 + 3.2 Encodings and transformation formats of ISO/IEC 10646....... 10 + 3.3 Native CCSs and charsets.................................... 11 + 4. Character Issues............................................... 12 + 4.1 Types of characters......................................... 15 + 5. User interface for text........................................ 17 + 6. Text in current IETF protocols................................. 19 + 7. Other Common Terms In Internationalization..................... 22 + 8. Security Considerations........................................ 25 + 9. References..................................................... 25 + 9.1 Normative References........................................ 25 + 9.2 Informative References...................................... 26 + 10. Additional Interesting Reading................................ 27 + 11. Index......................................................... 27 + A. Acknowledgements............................................... 29 + B. Author's Address............................................... 29 + Full Copyright Statement.......................................... 30 + + + +Hoffman Informational [Page 1] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + +1. Introduction + + As [RFC2277] summarizes: "Internationalization is for humans. This + means that protocols are not subject to internationalization; text + strings are." Many protocols throughout the IETF use text strings + that are entered by, or are visible to, humans. It should be + possible for anyone to enter or read these text strings, which means + that Internet users must be able to be enter text in typical input + methods and displayed in any human language. Further, text + containing any character should be able to be passed between Internet + applications easily. This is the challenge of internationalization. + +1.1 Purpose of this document + + This document provides a glossary of terms used in the IETF when + discussing internationalization. The purpose is to help frame + discussions of internationalization in the various areas of the IETF + and to help introduce the main concepts to IETF participants. + + Internationalization is discussed in many working groups of the IETF. + However, few working groups have internationalization experts. When + designing or updating protocols, the question often comes up "should + we internationalize this" (or, more likely, "do we have to + internationalize this"). + + This document gives an overview of internationalization as it applies + to IETF standards work by lightly covering the many aspects of + internationalization and the vocabulary associated with those topics. + It is not meant to be a complete description of internationalization. + The definitions in this document are not normative for IETF + standards; however, they are useful and standards may make + informative reference to this document after it becomes an RFC. Some + of the definitions in this document come from many earlier IETF + documents and books. + + As in many fields, there is disagreement in the internationalization + community on definitions for many words. The topic of language + brings up particularly passionate opinions for experts and non- + experts alike. This document attempts to define terms in a way that + will be most useful to the IETF audience. + + This document uses definitions from many documents that have been + developed outside the IETF. The primary documents used are: + + - ISO/IEC 10646 [ISOIEC10646] + + - The Unicode Standard [UNICODE] + + + + +Hoffman Informational [Page 2] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + - W3C Character Model [CHARMOD] + + - IETF RFCs, including [RFC2277] + +1.2 Format of the definitions in this document + + In the body of this document, the source for the definition is shown + in angle brackets, such as "<ISOIEC10646>". Many definitions are + shown as "<NONE>", which means that the definitions were crafted + originally for this document. The angle bracket notation for the + source of definitions is different than the square bracket notation + used for references to documents, such as in the paragraph above; + these references are given in Section 9. + + For some terms, there are commentary and examples after the + definitions. In those cases, the part before the angle brackets is + the definition that comes from the original source, and the part + after the angle brackets is commentary that is not a definition (such + as examples or further exposition). + + Examples in this document use the notation for code points and names + from the Unicode Standard [UNICODE] and ISO/IEC 10646 [ISOIEC10646]. + For example, the letter "a" may be represented as either "U+0061" or + "LATIN SMALL LETTER A". + +2. Fundamental Terms + + This section covers basic topics that are needed for almost anyone + who is involved with making IETF protocols more friendly to non-ASCII + text and with other aspects of internationalization. + + language + + A language is a way that humans interact. The use of language + occurs in many forms, the most common of which are speech, + writing, and signing. <NONE> + + Some languages have a close relationship between the written and + spoken forms, while others have a looser relationship. [RFC3066] + discusses languages in more detail and provides identifiers for + languages for use in Internet protocols. Note that computer + languages are explicitly excluded from this definition. + + script + + A set of graphic characters used for the written form of one or + more languages. <ISOIEC10646> + + + + +Hoffman Informational [Page 3] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + Examples of scripts are Latin, Cyrillic, Greek, Arabic, and Han + (the ideographs used in writing Chinese, Japanese, and Korean). + [RFC2277] discusses scripts in detail. + + It is common for internationalization novices to mix up the terms + "language" and "script". This can be a problem in protocols that + differentiate the two. Almost all protocols that are designed (or + were re-designed) to handle non-ASCII text deal with scripts (the + written systems) or characters, while fewer actually deal with + languages. + + A single name can mean either a language or a script; for example, + "Arabic" is both the name of a language and the name of a script. + In fact, many scripts borrow their names from the names of + languages. Further, many scripts are used for many languages; for + example, the Russian and Bulgarian languages are written in the + Cyrillic script. Some languages can be expressed using different + scripts; the Mongolian language can be written in either the + Mongolian and Cyrillic scripts, and the Serbo-Croatian language is + written using both the Latin and Cyrillic scripts. Further, some + languages are normally expressed with more than one script at the + same time; for example, the Japanese language is normally + expressed in the Kanji (Han), Katakana, and Hiragana scripts in a + single string of text. + + character + + A member of a set of elements used for the organization, control, + or representation of data. <ISOIEC10646> + + There are at least three common definitions of the word + "character": + + - a general description of a text entity + + - a unit of a writing system, often synonymous with "letter" or + similar terms + + - the encoded entity itself + + When people talk about characters, they are mostly using one of + the first two definitions. + + A particular character is identified by its name, not by its + shape. A name may suggest a meaning, but the character may be + used for representing other meanings as well. A name may suggest + + + + + +Hoffman Informational [Page 4] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + a shape, but that does not imply that only that shape is commonly + used in print, nor that the particular shape is associated only + with that name. + + coded character + + A character together with its coded representation. <ISOIEC10646> + + coded character set + + A coded character set (CCS) is a set of unambiguous rules that + establishes a character set and the relationship between the + characters of the set and their coded representation. + <ISOIEC10646> + + character encoding form + + A character encoding form is a mapping from a character set + definition to the actual code units used to represent the data. + <UNICODE> + + repertoire + + The collection of characters included in a character set. Also + called a character repertoire. <UNICODE> + + glyph + + A glyph is an abstract form that represents one or more glyph + images. The term "glyph" is often a synonym for glyph image, + which is the actual, concrete image of a glyph representation + having been rasterized or otherwise imaged onto some display + surface. In displaying character data, one or more glyphs may be + selected to depict a particular character. These glyphs are + selected by a rendering engine during composition and layout + processing. <UNICODE> + + glyph code + + A glyph code is a numeric code that refers to a glyph. Usually, + the glyphs contained in a font are referenced by their glyph code. + Glyph codes are local to a particular font; that is, a different + font containing the same glyphs may use different codes. + <UNICODE> + + + + + + + +Hoffman Informational [Page 5] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + transcoding + + Transcoding is the process of converting text data from one + character encoding form to another. Transcoders work only at the + level of character encoding and do not parse the text. Note: + Transcoding may involve one-to-one, many-to-one, one-to-many or + many-to-many mappings. Because some legacy mappings are glyphic, + they may not only be many-to-many, but also discontinuous: thus + XYZ may map to yxz. <CHARMOD> + + In this definition, "many-to-one" means a sequence of characters + mapped to a single character. The "many" does not mean + alternative characters that map to the single character. + + character encoding scheme + + A character encoding scheme (CES) is a character encoding form + plus byte serialization. There are many character encoding + schemes in Unicode, such as UTF-8 and UTF-16. <UNICODE> + + Some CESs are associated with a single CCS; for example, UTF-8 + [RFC2279] applies only to ISO/IEC 10646. Other CESs, such as ISO + 2022, are associated with many CCSs. + + charset + + A charset is a method of mapping a sequence of octets to a + sequence of abstract characters. A charset is, in effect, a + combination of one or more CCSs with a CES. Charset names are + registered by the IANA according to procedures documented in + [RFC2278]. <NONE> + + Many protocol definitions use the term "character set" in their + descriptions. The terms "charset" or "character encoding scheme" + are strongly preferred over the term "character set" because + "character set" has other definitions in other contexts and this + can be confusing. + + internationalization + + In the IETF, "internationalization" means to add or improve the + handling of non-ASCII text in a protocol. <NONE> + + Many protocols that handle text only handle one script (often, the + one that contains the letters used in English text), or leave the + question of what character set is used up to local guesswork + + + + + +Hoffman Informational [Page 6] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + (which leads, of course, to interoperability problems). Adding + non-ASCII text to such a protocol allows the protocol to handle + more scripts, hopefully all of the ones useful in the world. + + localization + + The process of adapting an internationalized application platform + or application to a specific cultural environment. In + localization, the same semantics are preserved while the syntax + may be changed. [FRAMEWORK] + + Localization is the act of tailoring an application for a + different language or script or culture. Some internationalized + applications can handle a wide variety of languages. Typical + users only understand a small number of languages, so the program + must be tailored to interact with users in just the languages they + know. + + The major work of localization is translating the user interface + and documentation. Localization involves not only changing the + language interaction, but also other relevant changes such as + display of numbers, dates, currency, and so on. The better + internationalized an application is, the easier it is to localize + it for a particular language and character encoding scheme. + + Localization is rarely an IETF matter, and protocols that are + merely localized, even if they are serially localized for several + locations, are generally considered unsatisfactory for the global + Internet. + + Do not confuse "localization" with "locale", which is described in + Section 7 of this document. + + i18n, l10n + + These are abbreviations for "internationalization" and + "localization". <NONE> + + "18" is the number of characters between the "i" and the "n" in + "internationalization", and "10" is the number of characters + between the "l" and the "n" in "localization". + + multilingual + + The term "multilingual" has many widely-varying definitions and + thus is not recommended for use in standards. Some of the + definitions relate to the ability to handle international + + + + +Hoffman Informational [Page 7] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + characters; other definitions relate to the ability to handle + multiple charsets; and still others relate to the ability to + handle multiple languages. <NONE> + + displaying and rendering text + + To display text, a system puts characters on a visual display + device such as a screen or a printer. To render text, a system + analyzes the character input to determine how to display the text. + The terms "display" and "render" are sometimes used + interchangeably. Note, however, that text might be rendered as + audio and/or tactile output, such as in systems that have been + designed for people with visual disabilities. <NONE> + + Combining characters modify the display of the character (or, in + some cases, characters) that precede them. When rendering such + text, the display engine must either find the glyph in the font + that represents the base character and all of the combining + characters, or it must render the combination itself. Such + rendering can be straight-forward, but it is sometimes complicated + when the combining marks interact with each other, such as when + there are two combining marks that would appear above the same + character. Formatting characters can also change the way that a + renderer would display text. Rendering can also be difficult for + some scripts that have complex display rules for base characters, + such as Arabic and Indic scripts. + +3. Standards Bodies and Standards + + This section describes some of the standards bodies and standards + that appear in discussions of internationalization in the IETF. This + is an incomplete and possibly over-full list; listing too few bodies + or standards can be just as politically dangerous as listing too + many. Note that there are many other bodies that deal with + internationalization; however, few if any of them appear commonly in + IETF standards work. + +3.1 Standards bodies + + ISO + + The International Organization for Standardization has been + involved with standards for characters since before the IETF was + started. ISO is a non-governmental group made up of national + bodies. ISO has many diverse standards in the international + characters area; the one that is most used in the IETF is commonly + referred to as "ISO/IEC 10646", although its official name has + + + + +Hoffman Informational [Page 8] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + more qualifications. (The IEC is International Electrotechnical + Commission). ISO/IEC 10646 describes a CCS that covers almost all + known written characters in use today. + + ISO/IEC 10646 is controlled by the group known as "ISO/IEC JTC + 1/SC 2 WG2", often called "WG2" for short. ISO standards go + through many steps before being finished, and years often go by + between changes to ISO/IEC 10646. Information on WG2, and its + work products, can be found at + <http://www.dkuug.dk/JTC1/SC2/WG2/>. + + The standard, which comes in multiple parts, can be purchased in + both print and CD-ROM versions. One example of how to cite the + standard is given in [RFC2279]. Any standard that cites ISO/IEC + 10646 needs to evaluate how to handle the versioning problem that + is relevant to the protocol's needs. + + ISO is responsible for other standards that might be of interest + to protocol developers. [ISO 639] specifies the names of + languages, and [ISO 3166] specifies the abbreviations of + countries. Character work is done in the group known as ISO/IEC + JTC1/SC22 and ISO TC46, as well as other ISO groups. + + Another relevant ISO group is JTC 1/SC22/WG20, which is + responsible for internationalization in JTC1, such as for + international string ordering. Information on WG20, and its work + products, can be found at <http://www.dkuug.dk/jtc1/sc22/wg20/> + + Unicode Consortium + + The second important group for international character standards + is the Unicode Consortium. The Unicode Consortium is a trade + association of companies, governments, and other groups interested + in promoting the Unicode Standard [UNICODE]. The Unicode Standard + is a CCS whose repertoire and code points are identical to ISO/IEC + 10646. The Unicode Consortium has added features to the base CCS + which make it more useful in protocols, such as defining + attributes for each character. Examples of these attributes + include case conversion and numeric properties. + + The Unicode Consortium publishes addenda to the Unicode Standard + as Unicode Technical Reports. There are many types of technical + reports at various stages of maturity. The Unicode Standard and + affiliated technical reports can be found at + <http://www.unicode.org/>. + + + + + + +Hoffman Informational [Page 9] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + World Wide Web Consortium (W3C) + + This group created and maintains the standard for XML, the markup + language for text that has become very popular. XML has always + been fully internationalized so that there is no need for a new + version to handle international text. + + local and regional standards organizations + + Just as there are many native CCSs and charsets, there are many + local and regional standards organizations to create and support + them. Common examples of these are ANSI (United States), and + CEN/ISSS (Europe). + +3.2 Encodings and transformation formats of ISO/IEC 10646 + + Characters in the ISO/IEC 10646 CCS can be expressed in many ways. + Encoding forms are direct addressing methods, while transformation + formats are methods for expressing encoding forms as bits on the + wire. + + Basic Multilingual Plane (BMP) + + The BMP is composed of the first 2^16 code points in ISO/IEC + 10646. The BMP is also called "plane 0". + + UCS-2 and UCS-4 + + UCS-2 and UCS-4 are the two encoding forms defined for ISO/IEC + 10646. UCS-2 addresses only the BMP. Because many useful + characters (such as many Han characters) have been defined outside + of the BMP, many people would consider UCS-2 to be dead. + Theoretically, UCS-4 addresses the entire range of 2^31 code + points from ISO/IEC 10646 as 32-bit values. However, for + interoperability with UTF-16, ISO 10646 restricts the range of + characters that will actually be allocated to the values + 0..0x10FFFF. + + UTF-8 + + UTF-8, a transformation format specified in [RFC2279], is the + preferred encoding for IETF protocols. Characters in the BMP are + encoded as one, two, or three octets. Characters outside the BMP + are encoded as four octets. Characters from the US-ASCII + repertoire have the same on-the-wire representation in UTF-8 as + they do in US-ASCII. + + + + + +Hoffman Informational [Page 10] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + UTF-16, UTF-16BE, and UTF-16LE + + UTF-16, UTF-16BE, and UTF-16LE, three transformation formats + defined in [RFC2781], are not required by any IETF standards, and + are thus used much less often than UTF-8. Characters in the BMP + are always encoded as two octets, and characters outside the BMP + are encoded as four octets. The three formats differ based on the + order of the octets and the presence of a special lead-in mark + called the "byte order mark" or "BOM". + + UTF-32 + + The Unicode Consortium has defined UTF-32 as a transformation + format for UCS-4 in [UTR19]. + + SCSU and BOCU-1 + + The Unicode Consortium has defined an encoding, SCSU, which is + designed to offer good compression for typical text. SCSU is + described in [UTR6]. A different encoding that is meant to be + MIME-friendly, BOCU-1, is described in [UTN6]. Although + compression is attractive, as opposed to UTF-8 , neither of these + (at the time of this writing) has attracted much interest in the + IETF. + +3.3 Native CCSs and charsets + + Before ISO/IEC 10646 was developed, many countries developed their + own CCSs and charsets. Many dozen of these are in common use on the + Internet today. Examples include ISO 8859-5 for Cyrillic and Shift- + JIS for Japanese scripts. + + The official list of the registered charset names for use with IETF + protocols is maintained by IANA and can be found at + <http://www.iana.org/assignments/character-sets>. The list contains + preferred names and aliases. Note that this list has historically + contained many errors, such as names that are in fact not charsets or + references that do not give enough detail to reliably map names to + charsets. + + Probably the most well-known native CCS is ASCII [US-ASCII]. This + CCS is used as the basis for keywords and parameter names in many + IETF protocols, and as the sole CCS in numerous IETF protocols that + have not yet been internationalized. + + [UTR22] describes issues involved in mapping character data between + charsets, and an XML format for mapping table data. + + + + +Hoffman Informational [Page 11] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + +4. Character Issues + + This section contains terms and topics that are commonly used in + character handling and therefore are of concern to people adding + non-ASCII text handling to protocols. These topics are standardized + outside the IETF. + + combining character + + A member of an identified subset of the coded character set of + ISO/IEC 10646 intended for combination with the preceding non- + combining graphic character, or with a sequence of combining + characters preceded by a non-combining character. <ISOIEC10646> + + composite sequence + + A sequence of graphic characters consisting of a non-combining + character followed by one or more combining characters. A graphic + symbol for a composite sequence generally consists of the + combination of the graphic symbols of each character in the + sequence. A composite sequence is not a character and therefore + is not a member of the repertoire of ISO/IEC 10646. <ISOIEC10646> + + In some CCSs, some characters consist of combinations of other + characters. For example, the letter "a with acute" might be a + combination of the two characters "a" and "combining acute", or it + might be a combination of the three characters "a", a non- + destructive backspace, and an acute. The rules for combining two + or more characters are called "composition rules", and the rules + for taking apart a character into other characters is called + "decomposition rules". The results of composition is called a + "precomposed character"; the results of decomposition is called a + "decomposed character". + + normalization + + Normalization is the transformation of data to a normal form, for + example, to unify spelling. <UNICODE> + + Note that the phrase "unify spelling" in the definition above does + not mean unifying different words with the same meaning (such as + "color" and "colour"). Instead, it means unifying different + character sequences that are intended to form the same composite + characters (such as "<a><n><combining tilde><o>" and "<a><n with + tilde><o>"). + + + + + + +Hoffman Informational [Page 12] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + The purpose of normalization is to allow two strings to be + compared for equivalence. The strings "<a><n><combining + tilde><o>" and "<a><n with tilde><o>" would be shown identically + on a text display device. If a protocol designer wants those two + strings to be considered equivalent during comparison, the + protocol must define where normalization occurs. + + The terms "normalization" and "canonicalization" are often used + interchangeably. Generally, they both mean to convert a string of + one or more characters into another string based on standardized + rules. Some CCSs allow multiple equivalent representations for a + written string; normalization selects one among multiple + equivalent representations as a base for reference purposes in + comparing strings. In strings of text, these rules are usually + based on decomposing combined characters or composing characters + with combining characters. [UTR15] describes the process and many + forms of normalization in detail. Normalization is important when + comparing strings to see if they are the same. + + case + + Case is the feature of certain alphabets where the letters have + two distinct forms. These variants, which may differ markedly in + shape and size, are called the uppercase letter (also known as + capital or majuscule) and the lowercase letter (also known as + small or minuscule). Case mapping is the association of the + uppercase and lowercase forms of a letter. <UNICODE> + + There is usually (but not always) a one-to-one mapping between the + same letter in the two cases. However, there are many examples of + characters which exist in one case but for which there is no + corresponding character in the other case or for which there is a + special mapping rule, such as the Turkish dotless "i" and some + Greek characters with modifiers. Case mapping can even be + dependent on locale. Converting text to have only one case is + called "case folding". + + sorting and collation + + Collating is the process of ordering units of textual information. + Collation is usually specific to a particular language. It is + sometimes known as alphabetizing, although alphabetization is just + a special case of sorting and collation. <UNICODE> + + + + + + + + +Hoffman Informational [Page 13] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + Collation is concerned with the determination of the relative + order of any particular pair of strings, and algorithms concerned + with collation focus on the problem of providing appropriate + weighted keys for string values, to enable binary comparison of + the key values to determine the relative ordering of the strings. + + Sorting is the process of actually putting data records into + specified orders, according to criteria for comparison between the + records. Sorting can apply to any kind of data (including textual + data) for which an ordering criterion can be defined. Algorithms + concerned with sorting focus on the problem of performance (in + terms of time, memory, or other resources) in actually putting the + data records into a specified order. + + A sorting algorithm for string data can be internationalized by + providing it with the appropriate collation-weighted keys + corresponding to the strings to be ordered. + + Many processes have a need to order strings in a consistent + sequence (sorted). For only a few CCS/CES combinations, there is + an obvious sort order that can be done without reference to the + linguistic meaning of the characters: the codepoint order is + sufficient for sorting. That is, the codepoint order is also the + order that a person would use in sorting the characters. For many + CCS/CES combinations, the codepoint order would make no sense to a + person and therefore is not useful for sorting if the results will + be displayed to a person. + + Codepoint order is usually not how any human educated by a local + school system expects to see strings ordered; if one orders to the + expectations of a human, one has a language-specific sort. + Sorting to codepoint order will seem inconsistent if the strings + are not normalized before sorting because different + representations of the same character will sort differently. This + problem may be smaller with a language-specific sort. + + code table + + A code table is a table showing the characters allocated to the + octets in a code. <ISOIEC10646> + + Code tables are also commonly called "code charts". + + + + + + + + + +Hoffman Informational [Page 14] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + +4.1 Types of characters + + The following definitions of types of characters do not clearly + delineate each character into one type, nor do they allow someone to + accurately predict what types would apply to a particular character. + The definitions are intended for application designers to help them + think about the many (sometimes confusing) properties of text. + + alphabetic + + An informative Unicode property. Characters that are the primary + units of alphabets and/or syllabaries, whether combining or + noncombining. This includes composite characters that are + canonical equivalents to a combining character sequence of an + alphabetic base character plus one or more combining characters: + letter digraphs; contextual variant of alphabetic characters; + ligatures of alphabetic characters; contextual variants of + ligatures; modifier letters; letterlike symbols that are + compatibility equivalents of single alphabetic letters; and + miscellaneous letter elements. <UNICODE> + + ideographic + + Any symbol that primarily denotes an idea (or meaning) in contrast + to a sound (or pronunciation), for example, a symbol showing a + telephone or the Han characters used in Chinese, Japanese, and + Korean. <UNICODE> + + punctuation + + Characters that separate units of text, such as sentences and + phrases, thus clarifying the meaning of the text. The use of + punctuation marks is not limited to prose; they are also used in + mathematical and scientific formulae, for example. <UNICODE> + + symbol + + One of a set of characters other than those used for letters, + digits, or punctuation, and representing various concepts + generally not connected to written language use per se. Examples + include symbols for mathematical operators, symbols for OCR, + symbols for box-drawing or graphics, and symbols for dingbats. + <NONE> + + Examples of symbols include characters for arrows, faces, and + geometric shapes. [UNICODE] has a property that defines + characters as symbols. + + + + +Hoffman Informational [Page 15] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + nonspacing character + + A combining character whose positioning in presentation is + dependent on its base character. It generally does not consume + space along the visual baseline in and of itself. <UNICODE> + + A combining acute accent (U+0301) is an example of a nonspacing + character. + + diacritic + + A mark applied or attached to a symbol to create a new symbol that + represents a modified or new value. They can also be marks + applied to a symbol irrespective of whether it changes the value + of that symbol. In the latter case, the diacritic usually + represents an independent value (for example, an accent, tone, or + some other linguistic information). Also called diacritical mark + or diacritical. <UNICODE> + + control character + + The 65 characters in the ranges U+0000..U+001F and U+007F..U+009F. + They are also known as control codes. <UNICODE> + + formatting character + + Characters that are inherently invisible but that have an effect + on the surrounding characters. <UNICODE> + + Examples of formatting characters include characters for + specifying the direction of text and characters that specify how + to join multiple characters. + + compatibility character + + A graphic character included as a coded character of ISO/IEC 10646 + primarily for compatibility with existing coded character sets. + <ISOIEC10646> + + For example, U+FF01 (FULLWIDTH EXCLAMATION MARK) was included for + compatibility with Asian character sets that include full-width + and half-width ASCII characters. + + + + + + + + + +Hoffman Informational [Page 16] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + +5. User interface for text + + Although the IETF does not standardize user interfaces, many + protocols make assumptions about how a user will enter or see text + that is used in the protocol. Internationalization challenges + assumptions about the type and limitations of the input and output + devices that may be used with applications that use various + protocols. It is therefore useful to consider how users typically + interact with text that might contain one or more non-ASCII + characters. + + input methods + + An input method is a mechanism for a person to enter text into an + application. <NONE> + + Text can be entered into a computer in many ways. Keyboards are + by far the most common device used, but many characters cannot be + entered on typical computer keyboards in a single stroke. Many + operating systems come with system software that lets users input + characters outside the range of what is allowed by keyboards. + + For example, there are dozens of different input methods for Han + characters in Chinese, Japanese, and Korean. Some start with + phonetic input through the keyboard, while others use the number + of strokes in the character. Input methods are also needed for + scripts that have many diacritics, such as European characters + that have two or three diacritics on a single alphabetic + character. + + rendering rules + + A rendering rule is an algorithm that a system uses to decide how + to display a string of text. <NONE> + + Some scripts can be directly displayed with fonts, where each + character from an input stream can simply be copied from a glyph + system and put on the screen or printed page. Other scripts need + rules that are based on the context of the characters in order to + render text for display. + + Some examples of these rendering rules include: + + - Scripts such as Arabic (and many others), where the form of + the letter changes depending on the adjacent letters, whether + the letter is standing alone, at the beginning of a word, in + the middle of a word, or at the end of a word. The rendering + rules must choose between two or more glyphs. + + + +Hoffman Informational [Page 17] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + - Scripts such as the Indic scripts, where consonants may + change their form if they are adjacent to certain other + consonants or may be displayed in an order different from + the way they are stored and pronounced. The rendering rules + must choose between two or more glyphs. + + - Arabic and Hebrew scripts, where the order of the characters + displayed are changed by the bidirectional properties of the + alphabetic characters and with right-to-left and + left-to-right ordering marks. The rendering rules must + choose the order that characters are displayed. + + graphic symbol + + A graphic symbol is the visual representation of a graphic + character or of a composite sequence. <ISOIEC10646> + + font + + A font is a collection of glyphs used for the visual depiction of + character data. A font is often associated with a set of + parameters (for example, size, posture, weight, and serifness), + which, when set to particular values, generate a collection of + imagable glyphs. <UNICODE> + + bidirectional display + + The process or result of mixing left-to-right oriented text and + right-to-left oriented text in a single line is called + bidirectional display. <UNICODE> + + Most of the world's written languages are displayed left-to-right. + However, many widely-used written languages such as ones based on + the Hebrew or Arabic scripts are displayed right-to-left. Right- + to-left text often confuses protocol writers because they have to + keep thinking in terms of the order of characters in a string in + memory, and that order might be different than what they see on + the screen. (Note that some languages are written both + horizontally and vertically.) + + Further, bidirectional text can cause confusion because there are + formatting characters in ISO/IEC 10646 which cause the order of + display of text to change. These explicit formatting characters + change the display regardless of the implicit left-to-right or + right-to-left properties of characters. + + + + + + +Hoffman Informational [Page 18] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + It is common to see strings with text in both directions, such as + strings that include both text and numbers, or strings that + contain a mixture of scripts. + + [UNICODE] has a long and incredibly detailed algorithm for + displaying bidirectional text. + + undisplayable character + + A character that has no displayable form. <NONE> + + For instance, the zero-width space (U+200B) cannot be displayed + because it takes up no horizontal space. Formatting characters + such as those for setting the direction of text are also + undisplayable. Note, however, that every character in [UNICODE] + has a glyph associated with it, and that the glyphs for + undisplayable characters are enclosed in a dashed square as an + indication that the actual character is undisplayable. + +6. Text in current IETF protocols + + Many IETF protocols started off being fully internationalized, while + others have been internationalized as they were revised. In this + process, IETF members have seen patterns in the way that many + protocols use text. This section describes some specific protocol + interactions with text. + + protocol elements + + Protocol elements are uniquely-named parts of a protocol. <NONE> + + Almost every protocol has named elements, such as "source port" in + TCP. In some protocols, the names of the elements (or text tokens + for the names) are transmitted within the protocol. For example, + in SMTP and numerous other IETF protocols, the names of the verbs + are part of the command stream. The names are thus part of the + protocol standard. The names of protocol elements are not + normally seen by end users. + + name spaces + + A name space is the set of valid names for a particular item, or + the syntactic rules for generating these valid names. <NONE> + + + + + + + + +Hoffman Informational [Page 19] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + Many items in Internet protocols use names to identify specific + instances or values. The names may be generated (by some + prescribed rules), registered centrally (e.g., such as with + IANA), or have a distributed registration and control mechanism, + such as the names in the DNS. + + on-the-wire encoding + + The encoding and decoding used before and after transmission over + the network is often called the "on-the-wire" (or sometimes just + "wire") format. <NONE> + + Characters are identified by codepoints. Before being transmitted + in a protocol, they must first be encoded as bits and octets. + Similarly, when characters are received in a transmission, they + have been encoded, and a protocol that needs to process the + individual characters needs to decode them before processing. + + parsed text + + Text strings that is analyzed for subparts. <NONE> + + In some protocols, free text in text fields might be parsed. For + example, many mail user agents will parse the words in the text of + the Subject: field to attempt to thread based on what appears + after the "Re:" prefix. + + charset identification + + Specification of the charset used for a string of text. <NONE> + + Protocols that allow more than one charset to be used in the same + place should require that the text be identified with the + appropriate charset. Without this identification, a program + looking at the text cannot definitively discern the charset of the + text. Charset identification is also called "charset tagging". + + language identification + + Specification of the human language used for a string of text. + <NONE> + + Some protocols (such as MIME and HTTP) allow text that is meant + for machine processing to be identified with the language used in + the text. Such identification is important for machine-processing + of the text, such as by systems that render the text by speaking + it. Language identification is also called "language tagging". + + + + +Hoffman Informational [Page 20] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + MIME + + MIME (Multipurpose Internet Mail Extensions) is a message format + that allows for textual message bodies and headers in character + sets other than US-ASCII in formats that require ASCII (most + notably, [RFC2822], the standard for Internet mail headers). MIME + is described in RFCs 2045 through 2049, as well as more recent + RFCs. <NONE> + + transfer encoding syntax + + A transfer encoding syntax (TES) (sometimes called a transfer + encoding scheme) is a reversible transform of already-encoded data + that is represented in one or more character encoding schemes. + <NONE> + + TESs are useful for encoding types of character data into an + another format, usually for allowing new types of data to be + transmitted over legacy protocols. The main examples of TESs used + in the IETF include Base64 and quoted-printable. + + Base64 + + Base64 is a transfer encoding syntax that allows binary data to be + represented by the ASCII characters A through Z, a through z, 0 + through 9, +, /, and =. It is defined in [RFC2045]. <NONE> + + quoted printable + + Quoted printable is a transfer encoding syntax that allows strings + that have non-ASCII characters mixed in with mostly ASCII + printable characters to be somewhat human readable. It is + described in [RFC2047]. <NONE> + + The quoted printable syntax is generally considered to be a + failure at being readable. It is jokingly referred to as "quoted + unreadable". + + XML + + XML (which is an approximate abbreviation for Extensible Markup + Language) is a popular method for structuring text. XML text is + explicitly tagged with charsets. The specification for XML can be + found at <http://www.w3.org/XML/>. <NONE> + + + + + + + +Hoffman Informational [Page 21] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + ASN.1 text formats + + The ASN.1 data description language has many formats for text + data. The formats allow for different repertoires and different + encodings. Some of the formats that appear in IETF standards + based on ASN.1 include IA5String (all ASCII characters), + PrintableString (most ASCII characters, but missing many + punctuation characters), BMPString (characters from ISO/IEC 10646 + plane 0 in UTF-16BE format), UTF8String (just as the name + implies), and TeletexString (also called T61String; the repertoire + changes over time). + + ASCII-compatible encoding (ACE) + + Starting in 1996, many ASCII-compatible encoding schemes (which + are actually transfer encoding syntaxes) have been proposed as + possible solutions for internationalizing host names. Their goal + is to be able to encode any string of ISO/IEC 10646 characters as + legal DNS host names (as described in STD 13). At the time of + this writing, no ACE has become an IETF standard. + +7. Other Common Terms In Internationalization + + This is a hodge-podge of other terms that have appeared in + internationalization discussions in the IETF. It is likely that + additional terms will be added as this document matures. + + locale + + Locale is the user-specific location and cultural information + managed by a computer. <NONE> + + Because languages differ from country to country (and even region + to region within a country), the locale of the user can often be + an important factor. Typically, the locale information for a user + includes the language(s) used. + + Locale issues go beyond character use, and can include things such + as the display format for currency, dates, and times. Some + locales (especially the popular "C" and "POSIX" locales) do not + include language information. + + It should be noted that there are many thorny, unsolved issues + with locale. For example, should text be viewed using the locale + information of the person who wrote the text or the person viewing + it? What if the person viewing it is travelling to different + locations? Should only some of the locale information affect + creation and editing of text? + + + +Hoffman Informational [Page 22] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + Latin characters + + "Latin characters" is a not-precise term for characters + historically related to ancient Greek script and currently used + throughout the world. <NONE> + + The base Latin characters make up the ASCII repertoire and have + been augmented by many single and multiple diacritics and quite a + few other characters. ISO/IEC 10646 encodes the Latin characters + in the ranges U+0020..U+024F, U+1E00..U+1EFF, and other ranges. + + romanization + + The transliteration of a non-Latin script into Latin characters. + <NONE> + + Because of the widespread use of Latin characters, people have + tried to represent many languages that are not based on a Latin + repertoire in Latin. For example, there are two popular + romanizations of Chinese: Wade-Giles and Pinyin, the latter of + which is by far more common today. Many romanization systems are + inexact and do not give perfect round trip mappings between the + native script and the Latin characters. + + CJK characters and Han characters + + The ideographic characters used in Chinese, Japanese, Korean, and + traditional Vietnamese writing systems are often called 'CJK + characters' after the initial letters of the language names in + English. They are also called "Han characters", after the term in + Chinese that is often used for these characters. <NONE> + + Note that CJK and Han characters do not include the phonetic + characters used in the Japanese and Korean languages. + + In ISO/IEC 10646, the Han characters were "unified", meaning that + each set of Han characters from Japanese, Chinese, and/or Korean + that had the same origin was assigned a single code point. The + positive result of this was that many fewer code points were + needed to represent Han; the negative result of this was that + characters that people who write the three languages think are + different have the same code point. There is a great deal of + disagreement on the nature, the origin, and the severity of the + problems caused by Han unification. + + + + + + + +Hoffman Informational [Page 23] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + translation + + The process of conveying the meaning of some passage of text in + one language, so that it can be expressed equivalently in another + language. <NONE> + + Many language translation systems are inexact and cannot be + applied repeatedly to go from one language to another to another. + + transliteration + + The process of representing the characters of an alphabetical or + syllabic system of writing by the characters of a conversion + alphabet. <NONE> + + Many script transliterations are exact, and many have perfect + round-trip mappings. The notable exception to this is + romanization, described above. Transliteration involves + converting text expressed in one script into another script, + generally on a letter-by-letter basis. + + transcription + + The process of systematically writing the sounds of some passage + of spoken language, generally with the use of a technical phonetic + alphabet (usually Latin-based) or other systematic transcriptional + orthography. Transcription also sometimes refers to the + conversion of written text into a transcribed (usually Latin- + based) form, based on the sound of the text as if it had been + spoken. <NONE> + + Unlike transliterations, which are generally designed to be + round-trip convertible, transcriptions of written material are + almost never round-trip convertible to their original form. + + regular expressions + + Regular expressions provide a mechanism to select specific strings + from a set of character strings. Regular expressions are a + language used to search for text within strings, and possibly + modify the text found with other text. <NONE> + + Pattern matching for text involves being able to represent one or + more code points in an abstract notation, such as searching for + all capital Latin letters or all punctuation. The most common + mechanism in IETF protocols for naming such patterns is the use of + regular expressions. There is no single regular expression + language, but there are numerous very similar dialects. + + + +Hoffman Informational [Page 24] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + The Unicode Consortium has a good discussion about how to adapt + regular expression engines to use Unicode. [UTR18] + + private use + + ISO/IEC 10646 code points from U+E000 to U+F8FF, U+F0000 to + U+FFFFD, and U+100000 to U+10FFFD are available for private use. + This refers to code points of the standard whose interpretation is + not specified by the standard and whose use may be determined by + private agreement among cooperating users. <UNICODE> + + The use of these "private use" characters is defined by the + parties who transmit and receive them, and is thus not appropriate + for standardization. (The IETF has a long history of private use + names for things such as "x-" names in MIME types, charsets, and + languages. The experience with these has been quite negative, + with many implementors assuming that private use names are in fact + public and long-lived.) + +8. Security Considerations + + Security is not discussed in this document. + +9. References + +9.1 Normative References + + [ISOIEC10646] ISO/IEC 10646-1:2000. International Standard -- + Information technology -- Universal Multiple-Octet + Coded Character Set (UCS) -- Part 1: Architecture and + Basic Multilingual Plane, 2000. + + [UNICODE] The Unicode Standard, Version 3.2.0 is defined by The + Unicode Standard, Version 3.0 (Reading, MA, Addison- + Wesley, 2000. ISBN 0-201-61633-5), as amended by the + Unicode Standard Annex #27: Unicode 3.1 + (http://www.unicode.org/reports/tr27/) and by the + Unicode Standard Annex #28: Unicode 3.2 + (http://www.unicode.org/reports/tr28/), The Unicode + Consortium, 2002. + + + + + + + + + + + +Hoffman Informational [Page 25] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + +9.2 Informative References + + [CHARMOD] Character Model for the World Wide Web 1.0, W3C, + <http://www.w3.org/TR/charmod/>. + + [FRAMEWORK] ISO/IEC TR 11017:1997(E). Information technology - + Framework for internationalization, prepared by ISO/IEC + JTC 1/SC 22/WG 20, 1997. + + [ISO 639] ISO 639:2000 (E/F) - Code for the representation of + names of languages, 2000. + + [ISO 3166] ISO 3166:1988 (E/F) - Codes for the representation of + names of countries, 2000. + + [RFC2045] Freed, N. and N. Borenstein, "MIME Part One: Format of + Internet Message Bodies", November 1996. + + [RFC2047] Moore, K., "MIME Part Three: Message Header Extensions + for Non-ASCII Text", RFC 2047, November 1996. + + [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and + Languages", BCP 18, RFC 2277, January 1998. + + [RFC2279] Yergeau, F., "UTF-8, a transformation format of ISO + 10646", RFC 2279, January 1998. + + [RFC2781] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO + 10646", RFC 2781, February 2000. + + [RFC2822] Resnick, P., "Internet Message Format", RFC 2822, April + 2001. + + [RFC3066] Alvestrand, H., "Tags for the Identification of + Languages", BCP 47, RFC 3066, January 2001. + + [US-ASCII] Coded Character Set -- 7-bit American Standard Code for + Information Interchange, ANSI X3.4-1986, 1986. + + [UTN6] "BOCU-1: MIME-Compatible Unicode Compression", M. + Scherer & M. Davis, Unicode Technical Note #6. + + [UTR6] "A Standard Compression Scheme for Unicode", M. Wolf, + et. al., Unicode Technical Report #6. + + [UTR15] "Unicode Normalization Forms", M. Davis & M. Duerst, + Unicode Technical Report #15. + + + + +Hoffman Informational [Page 26] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + [UTR18] "Unicode Regular Expression Guidelines", M. Davis, + Unicode Technical Report #18. + + [UTR19] "UTF-32", M. Davis, Unicode Technical Report #19. + + [UTR22] "Character Mapping Markup Language", M. Davis, Unicode + Technical Report #22. + +10. Additional Interesting Reading + + ALA-LC Romanization Tables, Randall Barry (ed.), U.S. Library of + Congress, 1997, ISBN 0844409405 + + Blackwell Encyclopedia of Writing Systems, Florian Coulmas, Blackwell + Publishers, 1999, ISBN 063121481X + + The World's Writing Systems, Peter Daniels and William Bright, Oxford + University Press, 1996, ISBN 0195079930 + + Writing Systems of the World, Akira Nakanishi, Charles E. Tuttle + Company, 1980, ISBN 0804816549 + +11. Index + + alphabetic -- 4.1 + ASCII-compatible encoding (ACE) -- 6 + ASN.1 text formats -- 6 + Base64 -- 6 + Basic Multilingual Plane (BMP) -- 3.2 + bidirectional display -- 5 + BOCU-1 -- 3.2 + case -- 4 + character -- 2 + character encoding form -- 2 + character encoding scheme -- 2 + charset -- 2 + charset identification -- 6 + CJK characters and Han characters -- 7 + code chart -- 4 + code table -- 4 + coded character -- 2 + coded character set -- 2 + combining character -- 4 + compatibility character -- 4.1 + composite sequence -- 4 + control character -- 4.1 + diacritic -- 4.1 + displaying and rendering text -- 2 + + + +Hoffman Informational [Page 27] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + + font -- 5 + formatting character -- 4.1 + glyph -- 2 + glyph code -- 2 + graphic symbol -- 5 + i18n, l10n -- 2 + ideographic -- 4.1 + input methods -- 5 + internationalization -- 2 + ISO -- 3.1 + language -- 2 + language identification -- 6 + Latin characters -- 7 + local and regional standards organizations -- 3.1 + locale -- 7 + localization -- 2 + MIME -- 6 + multilingual -- 2 + name spaces -- 6 + nonspacing character -- 4.1 + normalization -- 4 + on-the-wire encoding -- 6 + parsed text -- 6 + private use -- 7 + protocol elements -- 6 + punctuation -- 4.1 + quoted printable -- 6 + regular expressions -- 7 + rendering rules -- 5 + romanization -- 7 + script -- 2 + SCSU -- 3.2 + sorting and collation -- 4 + symbol -- 4.1 + transcoding -- 2 + transcription -- 7 + transfer encoding syntax -- 6 + translation -- 7 + transliteration -- 7 + UCS-2 and UCS-4 -- 3.2 + undisplayable character -- 5 + Unicode Consortium -- 3.1 + UTF-32 -- 3.2 + UTF-16, UTF-16BE, and UTF-16LE -- 3.2 + UTF-8 -- 3.2 + World Wide Web Consortium -- 3.1 + XML -- 6 + + + + +Hoffman Informational [Page 28] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + +A. Acknowledgements + + The definitions in this document come from many sources, including a + wide variety of IETF documents. + + James Seng contributed to the initial outline of this document. + Harald Alvestrand and Martin Duerst made extensive useful comments on + early versions. Others who contributed to the development include: + + Dan Kohn + Jacob Palme + Johan van Wingen + Peter Constable + Yuri Demchenko + Susan Harris + Zita Wenzel + John Klensin + Henning Schulzrinne + Leslie Daigle + Markus Scherer + Ken Whistler + +B. Author's Address + + Paul Hoffman + Internet Mail Consortium and VPN Consortium + 127 Segre Place + Santa Cruz, CA 95060 USA + + EMail: paul.hoffman@imc.org and paul.hoffman@vpnc.org + + + + + + + + + + + + + + + + + + + + + +Hoffman Informational [Page 29] + +RFC 3536 Terminology Used in Internationalization in the IETF May 2003 + + +Full Copyright Statement + + Copyright (C) The Internet Society (2003). All Rights Reserved. + + This document and translations of it may be copied and furnished to + others, and derivative works that comment on or otherwise explain it + or assist in its implementation may be prepared, copied, published + and distributed, in whole or in part, without restriction of any + kind, provided that the above copyright notice and this paragraph are + included on all such copies and derivative works. However, this + document itself may not be modified in any way, such as by removing + the copyright notice or references to the Internet Society or other + Internet organizations, except as needed for the purpose of + developing Internet standards in which case the procedures for + copyrights defined in the Internet Standards process must be + followed, or as required to translate it into languages other than + English. + + The limited permissions granted above are perpetual and will not be + revoked by the Internet Society or its successors or assigns. + + This document and the information contained herein is provided on an + "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING + TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING + BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION + HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF + MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. + +Acknowledgement + + Funding for the RFC Editor function is currently provided by the + Internet Society. + + + + + + + + + + + + + + + + + + + +Hoffman Informational [Page 30] + |