summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc3536.txt
diff options
context:
space:
mode:
authorThomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
committerThomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
commit4bfd864f10b68b71482b35c818559068ef8d5797 (patch)
treee3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc3536.txt
parentea76e11061bda059ae9f9ad130a9895cc85607db (diff)
doc: Add RFC documents
Diffstat (limited to 'doc/rfc/rfc3536.txt')
-rw-r--r--doc/rfc/rfc3536.txt1683
1 files changed, 1683 insertions, 0 deletions
diff --git a/doc/rfc/rfc3536.txt b/doc/rfc/rfc3536.txt
new file mode 100644
index 0000000..4156440
--- /dev/null
+++ b/doc/rfc/rfc3536.txt
@@ -0,0 +1,1683 @@
+
+
+
+
+
+
+Network Working Group P. Hoffman
+Request for Comments: 3536 IMC & VPNC
+Category: Informational May 2003
+
+
+ Terminology Used in Internationalization in the IETF
+
+Status of this Memo
+
+ This memo provides information for the Internet community. It does
+ not specify an Internet standard of any kind. Distribution of this
+ memo is unlimited.
+
+Copyright Notice
+
+ Copyright (C) The Internet Society (2003). All Rights Reserved.
+
+Abstract
+
+ This document provides a glossary of terms used in the IETF when
+ discussing internationalization. The purpose is to help frame
+ discussions of internationalization in the various areas of the IETF
+ and to help introduce the main concepts to IETF participants.
+
+Table of Contents
+
+ 1. Introduction................................................... 2
+ 1.1 Purpose of this document.................................... 2
+ 1.2 Format of the definitions in this document.................. 3
+ 2. Fundamental Terms.............................................. 3
+ 3. Standards Bodies and Standards................................. 8
+ 3.1 Standards bodies............................................ 8
+ 3.2 Encodings and transformation formats of ISO/IEC 10646....... 10
+ 3.3 Native CCSs and charsets.................................... 11
+ 4. Character Issues............................................... 12
+ 4.1 Types of characters......................................... 15
+ 5. User interface for text........................................ 17
+ 6. Text in current IETF protocols................................. 19
+ 7. Other Common Terms In Internationalization..................... 22
+ 8. Security Considerations........................................ 25
+ 9. References..................................................... 25
+ 9.1 Normative References........................................ 25
+ 9.2 Informative References...................................... 26
+ 10. Additional Interesting Reading................................ 27
+ 11. Index......................................................... 27
+ A. Acknowledgements............................................... 29
+ B. Author's Address............................................... 29
+ Full Copyright Statement.......................................... 30
+
+
+
+Hoffman Informational [Page 1]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+1. Introduction
+
+ As [RFC2277] summarizes: "Internationalization is for humans. This
+ means that protocols are not subject to internationalization; text
+ strings are." Many protocols throughout the IETF use text strings
+ that are entered by, or are visible to, humans. It should be
+ possible for anyone to enter or read these text strings, which means
+ that Internet users must be able to be enter text in typical input
+ methods and displayed in any human language. Further, text
+ containing any character should be able to be passed between Internet
+ applications easily. This is the challenge of internationalization.
+
+1.1 Purpose of this document
+
+ This document provides a glossary of terms used in the IETF when
+ discussing internationalization. The purpose is to help frame
+ discussions of internationalization in the various areas of the IETF
+ and to help introduce the main concepts to IETF participants.
+
+ Internationalization is discussed in many working groups of the IETF.
+ However, few working groups have internationalization experts. When
+ designing or updating protocols, the question often comes up "should
+ we internationalize this" (or, more likely, "do we have to
+ internationalize this").
+
+ This document gives an overview of internationalization as it applies
+ to IETF standards work by lightly covering the many aspects of
+ internationalization and the vocabulary associated with those topics.
+ It is not meant to be a complete description of internationalization.
+ The definitions in this document are not normative for IETF
+ standards; however, they are useful and standards may make
+ informative reference to this document after it becomes an RFC. Some
+ of the definitions in this document come from many earlier IETF
+ documents and books.
+
+ As in many fields, there is disagreement in the internationalization
+ community on definitions for many words. The topic of language
+ brings up particularly passionate opinions for experts and non-
+ experts alike. This document attempts to define terms in a way that
+ will be most useful to the IETF audience.
+
+ This document uses definitions from many documents that have been
+ developed outside the IETF. The primary documents used are:
+
+ - ISO/IEC 10646 [ISOIEC10646]
+
+ - The Unicode Standard [UNICODE]
+
+
+
+
+Hoffman Informational [Page 2]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ - W3C Character Model [CHARMOD]
+
+ - IETF RFCs, including [RFC2277]
+
+1.2 Format of the definitions in this document
+
+ In the body of this document, the source for the definition is shown
+ in angle brackets, such as "<ISOIEC10646>". Many definitions are
+ shown as "<NONE>", which means that the definitions were crafted
+ originally for this document. The angle bracket notation for the
+ source of definitions is different than the square bracket notation
+ used for references to documents, such as in the paragraph above;
+ these references are given in Section 9.
+
+ For some terms, there are commentary and examples after the
+ definitions. In those cases, the part before the angle brackets is
+ the definition that comes from the original source, and the part
+ after the angle brackets is commentary that is not a definition (such
+ as examples or further exposition).
+
+ Examples in this document use the notation for code points and names
+ from the Unicode Standard [UNICODE] and ISO/IEC 10646 [ISOIEC10646].
+ For example, the letter "a" may be represented as either "U+0061" or
+ "LATIN SMALL LETTER A".
+
+2. Fundamental Terms
+
+ This section covers basic topics that are needed for almost anyone
+ who is involved with making IETF protocols more friendly to non-ASCII
+ text and with other aspects of internationalization.
+
+ language
+
+ A language is a way that humans interact. The use of language
+ occurs in many forms, the most common of which are speech,
+ writing, and signing. <NONE>
+
+ Some languages have a close relationship between the written and
+ spoken forms, while others have a looser relationship. [RFC3066]
+ discusses languages in more detail and provides identifiers for
+ languages for use in Internet protocols. Note that computer
+ languages are explicitly excluded from this definition.
+
+ script
+
+ A set of graphic characters used for the written form of one or
+ more languages. <ISOIEC10646>
+
+
+
+
+Hoffman Informational [Page 3]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ Examples of scripts are Latin, Cyrillic, Greek, Arabic, and Han
+ (the ideographs used in writing Chinese, Japanese, and Korean).
+ [RFC2277] discusses scripts in detail.
+
+ It is common for internationalization novices to mix up the terms
+ "language" and "script". This can be a problem in protocols that
+ differentiate the two. Almost all protocols that are designed (or
+ were re-designed) to handle non-ASCII text deal with scripts (the
+ written systems) or characters, while fewer actually deal with
+ languages.
+
+ A single name can mean either a language or a script; for example,
+ "Arabic" is both the name of a language and the name of a script.
+ In fact, many scripts borrow their names from the names of
+ languages. Further, many scripts are used for many languages; for
+ example, the Russian and Bulgarian languages are written in the
+ Cyrillic script. Some languages can be expressed using different
+ scripts; the Mongolian language can be written in either the
+ Mongolian and Cyrillic scripts, and the Serbo-Croatian language is
+ written using both the Latin and Cyrillic scripts. Further, some
+ languages are normally expressed with more than one script at the
+ same time; for example, the Japanese language is normally
+ expressed in the Kanji (Han), Katakana, and Hiragana scripts in a
+ single string of text.
+
+ character
+
+ A member of a set of elements used for the organization, control,
+ or representation of data. <ISOIEC10646>
+
+ There are at least three common definitions of the word
+ "character":
+
+ - a general description of a text entity
+
+ - a unit of a writing system, often synonymous with "letter" or
+ similar terms
+
+ - the encoded entity itself
+
+ When people talk about characters, they are mostly using one of
+ the first two definitions.
+
+ A particular character is identified by its name, not by its
+ shape. A name may suggest a meaning, but the character may be
+ used for representing other meanings as well. A name may suggest
+
+
+
+
+
+Hoffman Informational [Page 4]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ a shape, but that does not imply that only that shape is commonly
+ used in print, nor that the particular shape is associated only
+ with that name.
+
+ coded character
+
+ A character together with its coded representation. <ISOIEC10646>
+
+ coded character set
+
+ A coded character set (CCS) is a set of unambiguous rules that
+ establishes a character set and the relationship between the
+ characters of the set and their coded representation.
+ <ISOIEC10646>
+
+ character encoding form
+
+ A character encoding form is a mapping from a character set
+ definition to the actual code units used to represent the data.
+ <UNICODE>
+
+ repertoire
+
+ The collection of characters included in a character set. Also
+ called a character repertoire. <UNICODE>
+
+ glyph
+
+ A glyph is an abstract form that represents one or more glyph
+ images. The term "glyph" is often a synonym for glyph image,
+ which is the actual, concrete image of a glyph representation
+ having been rasterized or otherwise imaged onto some display
+ surface. In displaying character data, one or more glyphs may be
+ selected to depict a particular character. These glyphs are
+ selected by a rendering engine during composition and layout
+ processing. <UNICODE>
+
+ glyph code
+
+ A glyph code is a numeric code that refers to a glyph. Usually,
+ the glyphs contained in a font are referenced by their glyph code.
+ Glyph codes are local to a particular font; that is, a different
+ font containing the same glyphs may use different codes.
+ <UNICODE>
+
+
+
+
+
+
+
+Hoffman Informational [Page 5]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ transcoding
+
+ Transcoding is the process of converting text data from one
+ character encoding form to another. Transcoders work only at the
+ level of character encoding and do not parse the text. Note:
+ Transcoding may involve one-to-one, many-to-one, one-to-many or
+ many-to-many mappings. Because some legacy mappings are glyphic,
+ they may not only be many-to-many, but also discontinuous: thus
+ XYZ may map to yxz. <CHARMOD>
+
+ In this definition, "many-to-one" means a sequence of characters
+ mapped to a single character. The "many" does not mean
+ alternative characters that map to the single character.
+
+ character encoding scheme
+
+ A character encoding scheme (CES) is a character encoding form
+ plus byte serialization. There are many character encoding
+ schemes in Unicode, such as UTF-8 and UTF-16. <UNICODE>
+
+ Some CESs are associated with a single CCS; for example, UTF-8
+ [RFC2279] applies only to ISO/IEC 10646. Other CESs, such as ISO
+ 2022, are associated with many CCSs.
+
+ charset
+
+ A charset is a method of mapping a sequence of octets to a
+ sequence of abstract characters. A charset is, in effect, a
+ combination of one or more CCSs with a CES. Charset names are
+ registered by the IANA according to procedures documented in
+ [RFC2278]. <NONE>
+
+ Many protocol definitions use the term "character set" in their
+ descriptions. The terms "charset" or "character encoding scheme"
+ are strongly preferred over the term "character set" because
+ "character set" has other definitions in other contexts and this
+ can be confusing.
+
+ internationalization
+
+ In the IETF, "internationalization" means to add or improve the
+ handling of non-ASCII text in a protocol. <NONE>
+
+ Many protocols that handle text only handle one script (often, the
+ one that contains the letters used in English text), or leave the
+ question of what character set is used up to local guesswork
+
+
+
+
+
+Hoffman Informational [Page 6]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ (which leads, of course, to interoperability problems). Adding
+ non-ASCII text to such a protocol allows the protocol to handle
+ more scripts, hopefully all of the ones useful in the world.
+
+ localization
+
+ The process of adapting an internationalized application platform
+ or application to a specific cultural environment. In
+ localization, the same semantics are preserved while the syntax
+ may be changed. [FRAMEWORK]
+
+ Localization is the act of tailoring an application for a
+ different language or script or culture. Some internationalized
+ applications can handle a wide variety of languages. Typical
+ users only understand a small number of languages, so the program
+ must be tailored to interact with users in just the languages they
+ know.
+
+ The major work of localization is translating the user interface
+ and documentation. Localization involves not only changing the
+ language interaction, but also other relevant changes such as
+ display of numbers, dates, currency, and so on. The better
+ internationalized an application is, the easier it is to localize
+ it for a particular language and character encoding scheme.
+
+ Localization is rarely an IETF matter, and protocols that are
+ merely localized, even if they are serially localized for several
+ locations, are generally considered unsatisfactory for the global
+ Internet.
+
+ Do not confuse "localization" with "locale", which is described in
+ Section 7 of this document.
+
+ i18n, l10n
+
+ These are abbreviations for "internationalization" and
+ "localization". <NONE>
+
+ "18" is the number of characters between the "i" and the "n" in
+ "internationalization", and "10" is the number of characters
+ between the "l" and the "n" in "localization".
+
+ multilingual
+
+ The term "multilingual" has many widely-varying definitions and
+ thus is not recommended for use in standards. Some of the
+ definitions relate to the ability to handle international
+
+
+
+
+Hoffman Informational [Page 7]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ characters; other definitions relate to the ability to handle
+ multiple charsets; and still others relate to the ability to
+ handle multiple languages. <NONE>
+
+ displaying and rendering text
+
+ To display text, a system puts characters on a visual display
+ device such as a screen or a printer. To render text, a system
+ analyzes the character input to determine how to display the text.
+ The terms "display" and "render" are sometimes used
+ interchangeably. Note, however, that text might be rendered as
+ audio and/or tactile output, such as in systems that have been
+ designed for people with visual disabilities. <NONE>
+
+ Combining characters modify the display of the character (or, in
+ some cases, characters) that precede them. When rendering such
+ text, the display engine must either find the glyph in the font
+ that represents the base character and all of the combining
+ characters, or it must render the combination itself. Such
+ rendering can be straight-forward, but it is sometimes complicated
+ when the combining marks interact with each other, such as when
+ there are two combining marks that would appear above the same
+ character. Formatting characters can also change the way that a
+ renderer would display text. Rendering can also be difficult for
+ some scripts that have complex display rules for base characters,
+ such as Arabic and Indic scripts.
+
+3. Standards Bodies and Standards
+
+ This section describes some of the standards bodies and standards
+ that appear in discussions of internationalization in the IETF. This
+ is an incomplete and possibly over-full list; listing too few bodies
+ or standards can be just as politically dangerous as listing too
+ many. Note that there are many other bodies that deal with
+ internationalization; however, few if any of them appear commonly in
+ IETF standards work.
+
+3.1 Standards bodies
+
+ ISO
+
+ The International Organization for Standardization has been
+ involved with standards for characters since before the IETF was
+ started. ISO is a non-governmental group made up of national
+ bodies. ISO has many diverse standards in the international
+ characters area; the one that is most used in the IETF is commonly
+ referred to as "ISO/IEC 10646", although its official name has
+
+
+
+
+Hoffman Informational [Page 8]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ more qualifications. (The IEC is International Electrotechnical
+ Commission). ISO/IEC 10646 describes a CCS that covers almost all
+ known written characters in use today.
+
+ ISO/IEC 10646 is controlled by the group known as "ISO/IEC JTC
+ 1/SC 2 WG2", often called "WG2" for short. ISO standards go
+ through many steps before being finished, and years often go by
+ between changes to ISO/IEC 10646. Information on WG2, and its
+ work products, can be found at
+ <http://www.dkuug.dk/JTC1/SC2/WG2/>.
+
+ The standard, which comes in multiple parts, can be purchased in
+ both print and CD-ROM versions. One example of how to cite the
+ standard is given in [RFC2279]. Any standard that cites ISO/IEC
+ 10646 needs to evaluate how to handle the versioning problem that
+ is relevant to the protocol's needs.
+
+ ISO is responsible for other standards that might be of interest
+ to protocol developers. [ISO 639] specifies the names of
+ languages, and [ISO 3166] specifies the abbreviations of
+ countries. Character work is done in the group known as ISO/IEC
+ JTC1/SC22 and ISO TC46, as well as other ISO groups.
+
+ Another relevant ISO group is JTC 1/SC22/WG20, which is
+ responsible for internationalization in JTC1, such as for
+ international string ordering. Information on WG20, and its work
+ products, can be found at <http://www.dkuug.dk/jtc1/sc22/wg20/>
+
+ Unicode Consortium
+
+ The second important group for international character standards
+ is the Unicode Consortium. The Unicode Consortium is a trade
+ association of companies, governments, and other groups interested
+ in promoting the Unicode Standard [UNICODE]. The Unicode Standard
+ is a CCS whose repertoire and code points are identical to ISO/IEC
+ 10646. The Unicode Consortium has added features to the base CCS
+ which make it more useful in protocols, such as defining
+ attributes for each character. Examples of these attributes
+ include case conversion and numeric properties.
+
+ The Unicode Consortium publishes addenda to the Unicode Standard
+ as Unicode Technical Reports. There are many types of technical
+ reports at various stages of maturity. The Unicode Standard and
+ affiliated technical reports can be found at
+ <http://www.unicode.org/>.
+
+
+
+
+
+
+Hoffman Informational [Page 9]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ World Wide Web Consortium (W3C)
+
+ This group created and maintains the standard for XML, the markup
+ language for text that has become very popular. XML has always
+ been fully internationalized so that there is no need for a new
+ version to handle international text.
+
+ local and regional standards organizations
+
+ Just as there are many native CCSs and charsets, there are many
+ local and regional standards organizations to create and support
+ them. Common examples of these are ANSI (United States), and
+ CEN/ISSS (Europe).
+
+3.2 Encodings and transformation formats of ISO/IEC 10646
+
+ Characters in the ISO/IEC 10646 CCS can be expressed in many ways.
+ Encoding forms are direct addressing methods, while transformation
+ formats are methods for expressing encoding forms as bits on the
+ wire.
+
+ Basic Multilingual Plane (BMP)
+
+ The BMP is composed of the first 2^16 code points in ISO/IEC
+ 10646. The BMP is also called "plane 0".
+
+ UCS-2 and UCS-4
+
+ UCS-2 and UCS-4 are the two encoding forms defined for ISO/IEC
+ 10646. UCS-2 addresses only the BMP. Because many useful
+ characters (such as many Han characters) have been defined outside
+ of the BMP, many people would consider UCS-2 to be dead.
+ Theoretically, UCS-4 addresses the entire range of 2^31 code
+ points from ISO/IEC 10646 as 32-bit values. However, for
+ interoperability with UTF-16, ISO 10646 restricts the range of
+ characters that will actually be allocated to the values
+ 0..0x10FFFF.
+
+ UTF-8
+
+ UTF-8, a transformation format specified in [RFC2279], is the
+ preferred encoding for IETF protocols. Characters in the BMP are
+ encoded as one, two, or three octets. Characters outside the BMP
+ are encoded as four octets. Characters from the US-ASCII
+ repertoire have the same on-the-wire representation in UTF-8 as
+ they do in US-ASCII.
+
+
+
+
+
+Hoffman Informational [Page 10]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ UTF-16, UTF-16BE, and UTF-16LE
+
+ UTF-16, UTF-16BE, and UTF-16LE, three transformation formats
+ defined in [RFC2781], are not required by any IETF standards, and
+ are thus used much less often than UTF-8. Characters in the BMP
+ are always encoded as two octets, and characters outside the BMP
+ are encoded as four octets. The three formats differ based on the
+ order of the octets and the presence of a special lead-in mark
+ called the "byte order mark" or "BOM".
+
+ UTF-32
+
+ The Unicode Consortium has defined UTF-32 as a transformation
+ format for UCS-4 in [UTR19].
+
+ SCSU and BOCU-1
+
+ The Unicode Consortium has defined an encoding, SCSU, which is
+ designed to offer good compression for typical text. SCSU is
+ described in [UTR6]. A different encoding that is meant to be
+ MIME-friendly, BOCU-1, is described in [UTN6]. Although
+ compression is attractive, as opposed to UTF-8 , neither of these
+ (at the time of this writing) has attracted much interest in the
+ IETF.
+
+3.3 Native CCSs and charsets
+
+ Before ISO/IEC 10646 was developed, many countries developed their
+ own CCSs and charsets. Many dozen of these are in common use on the
+ Internet today. Examples include ISO 8859-5 for Cyrillic and Shift-
+ JIS for Japanese scripts.
+
+ The official list of the registered charset names for use with IETF
+ protocols is maintained by IANA and can be found at
+ <http://www.iana.org/assignments/character-sets>. The list contains
+ preferred names and aliases. Note that this list has historically
+ contained many errors, such as names that are in fact not charsets or
+ references that do not give enough detail to reliably map names to
+ charsets.
+
+ Probably the most well-known native CCS is ASCII [US-ASCII]. This
+ CCS is used as the basis for keywords and parameter names in many
+ IETF protocols, and as the sole CCS in numerous IETF protocols that
+ have not yet been internationalized.
+
+ [UTR22] describes issues involved in mapping character data between
+ charsets, and an XML format for mapping table data.
+
+
+
+
+Hoffman Informational [Page 11]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+4. Character Issues
+
+ This section contains terms and topics that are commonly used in
+ character handling and therefore are of concern to people adding
+ non-ASCII text handling to protocols. These topics are standardized
+ outside the IETF.
+
+ combining character
+
+ A member of an identified subset of the coded character set of
+ ISO/IEC 10646 intended for combination with the preceding non-
+ combining graphic character, or with a sequence of combining
+ characters preceded by a non-combining character. <ISOIEC10646>
+
+ composite sequence
+
+ A sequence of graphic characters consisting of a non-combining
+ character followed by one or more combining characters. A graphic
+ symbol for a composite sequence generally consists of the
+ combination of the graphic symbols of each character in the
+ sequence. A composite sequence is not a character and therefore
+ is not a member of the repertoire of ISO/IEC 10646. <ISOIEC10646>
+
+ In some CCSs, some characters consist of combinations of other
+ characters. For example, the letter "a with acute" might be a
+ combination of the two characters "a" and "combining acute", or it
+ might be a combination of the three characters "a", a non-
+ destructive backspace, and an acute. The rules for combining two
+ or more characters are called "composition rules", and the rules
+ for taking apart a character into other characters is called
+ "decomposition rules". The results of composition is called a
+ "precomposed character"; the results of decomposition is called a
+ "decomposed character".
+
+ normalization
+
+ Normalization is the transformation of data to a normal form, for
+ example, to unify spelling. <UNICODE>
+
+ Note that the phrase "unify spelling" in the definition above does
+ not mean unifying different words with the same meaning (such as
+ "color" and "colour"). Instead, it means unifying different
+ character sequences that are intended to form the same composite
+ characters (such as "<a><n><combining tilde><o>" and "<a><n with
+ tilde><o>").
+
+
+
+
+
+
+Hoffman Informational [Page 12]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ The purpose of normalization is to allow two strings to be
+ compared for equivalence. The strings "<a><n><combining
+ tilde><o>" and "<a><n with tilde><o>" would be shown identically
+ on a text display device. If a protocol designer wants those two
+ strings to be considered equivalent during comparison, the
+ protocol must define where normalization occurs.
+
+ The terms "normalization" and "canonicalization" are often used
+ interchangeably. Generally, they both mean to convert a string of
+ one or more characters into another string based on standardized
+ rules. Some CCSs allow multiple equivalent representations for a
+ written string; normalization selects one among multiple
+ equivalent representations as a base for reference purposes in
+ comparing strings. In strings of text, these rules are usually
+ based on decomposing combined characters or composing characters
+ with combining characters. [UTR15] describes the process and many
+ forms of normalization in detail. Normalization is important when
+ comparing strings to see if they are the same.
+
+ case
+
+ Case is the feature of certain alphabets where the letters have
+ two distinct forms. These variants, which may differ markedly in
+ shape and size, are called the uppercase letter (also known as
+ capital or majuscule) and the lowercase letter (also known as
+ small or minuscule). Case mapping is the association of the
+ uppercase and lowercase forms of a letter. <UNICODE>
+
+ There is usually (but not always) a one-to-one mapping between the
+ same letter in the two cases. However, there are many examples of
+ characters which exist in one case but for which there is no
+ corresponding character in the other case or for which there is a
+ special mapping rule, such as the Turkish dotless "i" and some
+ Greek characters with modifiers. Case mapping can even be
+ dependent on locale. Converting text to have only one case is
+ called "case folding".
+
+ sorting and collation
+
+ Collating is the process of ordering units of textual information.
+ Collation is usually specific to a particular language. It is
+ sometimes known as alphabetizing, although alphabetization is just
+ a special case of sorting and collation. <UNICODE>
+
+
+
+
+
+
+
+
+Hoffman Informational [Page 13]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ Collation is concerned with the determination of the relative
+ order of any particular pair of strings, and algorithms concerned
+ with collation focus on the problem of providing appropriate
+ weighted keys for string values, to enable binary comparison of
+ the key values to determine the relative ordering of the strings.
+
+ Sorting is the process of actually putting data records into
+ specified orders, according to criteria for comparison between the
+ records. Sorting can apply to any kind of data (including textual
+ data) for which an ordering criterion can be defined. Algorithms
+ concerned with sorting focus on the problem of performance (in
+ terms of time, memory, or other resources) in actually putting the
+ data records into a specified order.
+
+ A sorting algorithm for string data can be internationalized by
+ providing it with the appropriate collation-weighted keys
+ corresponding to the strings to be ordered.
+
+ Many processes have a need to order strings in a consistent
+ sequence (sorted). For only a few CCS/CES combinations, there is
+ an obvious sort order that can be done without reference to the
+ linguistic meaning of the characters: the codepoint order is
+ sufficient for sorting. That is, the codepoint order is also the
+ order that a person would use in sorting the characters. For many
+ CCS/CES combinations, the codepoint order would make no sense to a
+ person and therefore is not useful for sorting if the results will
+ be displayed to a person.
+
+ Codepoint order is usually not how any human educated by a local
+ school system expects to see strings ordered; if one orders to the
+ expectations of a human, one has a language-specific sort.
+ Sorting to codepoint order will seem inconsistent if the strings
+ are not normalized before sorting because different
+ representations of the same character will sort differently. This
+ problem may be smaller with a language-specific sort.
+
+ code table
+
+ A code table is a table showing the characters allocated to the
+ octets in a code. <ISOIEC10646>
+
+ Code tables are also commonly called "code charts".
+
+
+
+
+
+
+
+
+
+Hoffman Informational [Page 14]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+4.1 Types of characters
+
+ The following definitions of types of characters do not clearly
+ delineate each character into one type, nor do they allow someone to
+ accurately predict what types would apply to a particular character.
+ The definitions are intended for application designers to help them
+ think about the many (sometimes confusing) properties of text.
+
+ alphabetic
+
+ An informative Unicode property. Characters that are the primary
+ units of alphabets and/or syllabaries, whether combining or
+ noncombining. This includes composite characters that are
+ canonical equivalents to a combining character sequence of an
+ alphabetic base character plus one or more combining characters:
+ letter digraphs; contextual variant of alphabetic characters;
+ ligatures of alphabetic characters; contextual variants of
+ ligatures; modifier letters; letterlike symbols that are
+ compatibility equivalents of single alphabetic letters; and
+ miscellaneous letter elements. <UNICODE>
+
+ ideographic
+
+ Any symbol that primarily denotes an idea (or meaning) in contrast
+ to a sound (or pronunciation), for example, a symbol showing a
+ telephone or the Han characters used in Chinese, Japanese, and
+ Korean. <UNICODE>
+
+ punctuation
+
+ Characters that separate units of text, such as sentences and
+ phrases, thus clarifying the meaning of the text. The use of
+ punctuation marks is not limited to prose; they are also used in
+ mathematical and scientific formulae, for example. <UNICODE>
+
+ symbol
+
+ One of a set of characters other than those used for letters,
+ digits, or punctuation, and representing various concepts
+ generally not connected to written language use per se. Examples
+ include symbols for mathematical operators, symbols for OCR,
+ symbols for box-drawing or graphics, and symbols for dingbats.
+ <NONE>
+
+ Examples of symbols include characters for arrows, faces, and
+ geometric shapes. [UNICODE] has a property that defines
+ characters as symbols.
+
+
+
+
+Hoffman Informational [Page 15]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ nonspacing character
+
+ A combining character whose positioning in presentation is
+ dependent on its base character. It generally does not consume
+ space along the visual baseline in and of itself. <UNICODE>
+
+ A combining acute accent (U+0301) is an example of a nonspacing
+ character.
+
+ diacritic
+
+ A mark applied or attached to a symbol to create a new symbol that
+ represents a modified or new value. They can also be marks
+ applied to a symbol irrespective of whether it changes the value
+ of that symbol. In the latter case, the diacritic usually
+ represents an independent value (for example, an accent, tone, or
+ some other linguistic information). Also called diacritical mark
+ or diacritical. <UNICODE>
+
+ control character
+
+ The 65 characters in the ranges U+0000..U+001F and U+007F..U+009F.
+ They are also known as control codes. <UNICODE>
+
+ formatting character
+
+ Characters that are inherently invisible but that have an effect
+ on the surrounding characters. <UNICODE>
+
+ Examples of formatting characters include characters for
+ specifying the direction of text and characters that specify how
+ to join multiple characters.
+
+ compatibility character
+
+ A graphic character included as a coded character of ISO/IEC 10646
+ primarily for compatibility with existing coded character sets.
+ <ISOIEC10646>
+
+ For example, U+FF01 (FULLWIDTH EXCLAMATION MARK) was included for
+ compatibility with Asian character sets that include full-width
+ and half-width ASCII characters.
+
+
+
+
+
+
+
+
+
+Hoffman Informational [Page 16]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+5. User interface for text
+
+ Although the IETF does not standardize user interfaces, many
+ protocols make assumptions about how a user will enter or see text
+ that is used in the protocol. Internationalization challenges
+ assumptions about the type and limitations of the input and output
+ devices that may be used with applications that use various
+ protocols. It is therefore useful to consider how users typically
+ interact with text that might contain one or more non-ASCII
+ characters.
+
+ input methods
+
+ An input method is a mechanism for a person to enter text into an
+ application. <NONE>
+
+ Text can be entered into a computer in many ways. Keyboards are
+ by far the most common device used, but many characters cannot be
+ entered on typical computer keyboards in a single stroke. Many
+ operating systems come with system software that lets users input
+ characters outside the range of what is allowed by keyboards.
+
+ For example, there are dozens of different input methods for Han
+ characters in Chinese, Japanese, and Korean. Some start with
+ phonetic input through the keyboard, while others use the number
+ of strokes in the character. Input methods are also needed for
+ scripts that have many diacritics, such as European characters
+ that have two or three diacritics on a single alphabetic
+ character.
+
+ rendering rules
+
+ A rendering rule is an algorithm that a system uses to decide how
+ to display a string of text. <NONE>
+
+ Some scripts can be directly displayed with fonts, where each
+ character from an input stream can simply be copied from a glyph
+ system and put on the screen or printed page. Other scripts need
+ rules that are based on the context of the characters in order to
+ render text for display.
+
+ Some examples of these rendering rules include:
+
+ - Scripts such as Arabic (and many others), where the form of
+ the letter changes depending on the adjacent letters, whether
+ the letter is standing alone, at the beginning of a word, in
+ the middle of a word, or at the end of a word. The rendering
+ rules must choose between two or more glyphs.
+
+
+
+Hoffman Informational [Page 17]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ - Scripts such as the Indic scripts, where consonants may
+ change their form if they are adjacent to certain other
+ consonants or may be displayed in an order different from
+ the way they are stored and pronounced. The rendering rules
+ must choose between two or more glyphs.
+
+ - Arabic and Hebrew scripts, where the order of the characters
+ displayed are changed by the bidirectional properties of the
+ alphabetic characters and with right-to-left and
+ left-to-right ordering marks. The rendering rules must
+ choose the order that characters are displayed.
+
+ graphic symbol
+
+ A graphic symbol is the visual representation of a graphic
+ character or of a composite sequence. <ISOIEC10646>
+
+ font
+
+ A font is a collection of glyphs used for the visual depiction of
+ character data. A font is often associated with a set of
+ parameters (for example, size, posture, weight, and serifness),
+ which, when set to particular values, generate a collection of
+ imagable glyphs. <UNICODE>
+
+ bidirectional display
+
+ The process or result of mixing left-to-right oriented text and
+ right-to-left oriented text in a single line is called
+ bidirectional display. <UNICODE>
+
+ Most of the world's written languages are displayed left-to-right.
+ However, many widely-used written languages such as ones based on
+ the Hebrew or Arabic scripts are displayed right-to-left. Right-
+ to-left text often confuses protocol writers because they have to
+ keep thinking in terms of the order of characters in a string in
+ memory, and that order might be different than what they see on
+ the screen. (Note that some languages are written both
+ horizontally and vertically.)
+
+ Further, bidirectional text can cause confusion because there are
+ formatting characters in ISO/IEC 10646 which cause the order of
+ display of text to change. These explicit formatting characters
+ change the display regardless of the implicit left-to-right or
+ right-to-left properties of characters.
+
+
+
+
+
+
+Hoffman Informational [Page 18]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ It is common to see strings with text in both directions, such as
+ strings that include both text and numbers, or strings that
+ contain a mixture of scripts.
+
+ [UNICODE] has a long and incredibly detailed algorithm for
+ displaying bidirectional text.
+
+ undisplayable character
+
+ A character that has no displayable form. <NONE>
+
+ For instance, the zero-width space (U+200B) cannot be displayed
+ because it takes up no horizontal space. Formatting characters
+ such as those for setting the direction of text are also
+ undisplayable. Note, however, that every character in [UNICODE]
+ has a glyph associated with it, and that the glyphs for
+ undisplayable characters are enclosed in a dashed square as an
+ indication that the actual character is undisplayable.
+
+6. Text in current IETF protocols
+
+ Many IETF protocols started off being fully internationalized, while
+ others have been internationalized as they were revised. In this
+ process, IETF members have seen patterns in the way that many
+ protocols use text. This section describes some specific protocol
+ interactions with text.
+
+ protocol elements
+
+ Protocol elements are uniquely-named parts of a protocol. <NONE>
+
+ Almost every protocol has named elements, such as "source port" in
+ TCP. In some protocols, the names of the elements (or text tokens
+ for the names) are transmitted within the protocol. For example,
+ in SMTP and numerous other IETF protocols, the names of the verbs
+ are part of the command stream. The names are thus part of the
+ protocol standard. The names of protocol elements are not
+ normally seen by end users.
+
+ name spaces
+
+ A name space is the set of valid names for a particular item, or
+ the syntactic rules for generating these valid names. <NONE>
+
+
+
+
+
+
+
+
+Hoffman Informational [Page 19]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ Many items in Internet protocols use names to identify specific
+ instances or values. The names may be generated (by some
+ prescribed rules), registered centrally (e.g., such as with
+ IANA), or have a distributed registration and control mechanism,
+ such as the names in the DNS.
+
+ on-the-wire encoding
+
+ The encoding and decoding used before and after transmission over
+ the network is often called the "on-the-wire" (or sometimes just
+ "wire") format. <NONE>
+
+ Characters are identified by codepoints. Before being transmitted
+ in a protocol, they must first be encoded as bits and octets.
+ Similarly, when characters are received in a transmission, they
+ have been encoded, and a protocol that needs to process the
+ individual characters needs to decode them before processing.
+
+ parsed text
+
+ Text strings that is analyzed for subparts. <NONE>
+
+ In some protocols, free text in text fields might be parsed. For
+ example, many mail user agents will parse the words in the text of
+ the Subject: field to attempt to thread based on what appears
+ after the "Re:" prefix.
+
+ charset identification
+
+ Specification of the charset used for a string of text. <NONE>
+
+ Protocols that allow more than one charset to be used in the same
+ place should require that the text be identified with the
+ appropriate charset. Without this identification, a program
+ looking at the text cannot definitively discern the charset of the
+ text. Charset identification is also called "charset tagging".
+
+ language identification
+
+ Specification of the human language used for a string of text.
+ <NONE>
+
+ Some protocols (such as MIME and HTTP) allow text that is meant
+ for machine processing to be identified with the language used in
+ the text. Such identification is important for machine-processing
+ of the text, such as by systems that render the text by speaking
+ it. Language identification is also called "language tagging".
+
+
+
+
+Hoffman Informational [Page 20]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ MIME
+
+ MIME (Multipurpose Internet Mail Extensions) is a message format
+ that allows for textual message bodies and headers in character
+ sets other than US-ASCII in formats that require ASCII (most
+ notably, [RFC2822], the standard for Internet mail headers). MIME
+ is described in RFCs 2045 through 2049, as well as more recent
+ RFCs. <NONE>
+
+ transfer encoding syntax
+
+ A transfer encoding syntax (TES) (sometimes called a transfer
+ encoding scheme) is a reversible transform of already-encoded data
+ that is represented in one or more character encoding schemes.
+ <NONE>
+
+ TESs are useful for encoding types of character data into an
+ another format, usually for allowing new types of data to be
+ transmitted over legacy protocols. The main examples of TESs used
+ in the IETF include Base64 and quoted-printable.
+
+ Base64
+
+ Base64 is a transfer encoding syntax that allows binary data to be
+ represented by the ASCII characters A through Z, a through z, 0
+ through 9, +, /, and =. It is defined in [RFC2045]. <NONE>
+
+ quoted printable
+
+ Quoted printable is a transfer encoding syntax that allows strings
+ that have non-ASCII characters mixed in with mostly ASCII
+ printable characters to be somewhat human readable. It is
+ described in [RFC2047]. <NONE>
+
+ The quoted printable syntax is generally considered to be a
+ failure at being readable. It is jokingly referred to as "quoted
+ unreadable".
+
+ XML
+
+ XML (which is an approximate abbreviation for Extensible Markup
+ Language) is a popular method for structuring text. XML text is
+ explicitly tagged with charsets. The specification for XML can be
+ found at <http://www.w3.org/XML/>. <NONE>
+
+
+
+
+
+
+
+Hoffman Informational [Page 21]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ ASN.1 text formats
+
+ The ASN.1 data description language has many formats for text
+ data. The formats allow for different repertoires and different
+ encodings. Some of the formats that appear in IETF standards
+ based on ASN.1 include IA5String (all ASCII characters),
+ PrintableString (most ASCII characters, but missing many
+ punctuation characters), BMPString (characters from ISO/IEC 10646
+ plane 0 in UTF-16BE format), UTF8String (just as the name
+ implies), and TeletexString (also called T61String; the repertoire
+ changes over time).
+
+ ASCII-compatible encoding (ACE)
+
+ Starting in 1996, many ASCII-compatible encoding schemes (which
+ are actually transfer encoding syntaxes) have been proposed as
+ possible solutions for internationalizing host names. Their goal
+ is to be able to encode any string of ISO/IEC 10646 characters as
+ legal DNS host names (as described in STD 13). At the time of
+ this writing, no ACE has become an IETF standard.
+
+7. Other Common Terms In Internationalization
+
+ This is a hodge-podge of other terms that have appeared in
+ internationalization discussions in the IETF. It is likely that
+ additional terms will be added as this document matures.
+
+ locale
+
+ Locale is the user-specific location and cultural information
+ managed by a computer. <NONE>
+
+ Because languages differ from country to country (and even region
+ to region within a country), the locale of the user can often be
+ an important factor. Typically, the locale information for a user
+ includes the language(s) used.
+
+ Locale issues go beyond character use, and can include things such
+ as the display format for currency, dates, and times. Some
+ locales (especially the popular "C" and "POSIX" locales) do not
+ include language information.
+
+ It should be noted that there are many thorny, unsolved issues
+ with locale. For example, should text be viewed using the locale
+ information of the person who wrote the text or the person viewing
+ it? What if the person viewing it is travelling to different
+ locations? Should only some of the locale information affect
+ creation and editing of text?
+
+
+
+Hoffman Informational [Page 22]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ Latin characters
+
+ "Latin characters" is a not-precise term for characters
+ historically related to ancient Greek script and currently used
+ throughout the world. <NONE>
+
+ The base Latin characters make up the ASCII repertoire and have
+ been augmented by many single and multiple diacritics and quite a
+ few other characters. ISO/IEC 10646 encodes the Latin characters
+ in the ranges U+0020..U+024F, U+1E00..U+1EFF, and other ranges.
+
+ romanization
+
+ The transliteration of a non-Latin script into Latin characters.
+ <NONE>
+
+ Because of the widespread use of Latin characters, people have
+ tried to represent many languages that are not based on a Latin
+ repertoire in Latin. For example, there are two popular
+ romanizations of Chinese: Wade-Giles and Pinyin, the latter of
+ which is by far more common today. Many romanization systems are
+ inexact and do not give perfect round trip mappings between the
+ native script and the Latin characters.
+
+ CJK characters and Han characters
+
+ The ideographic characters used in Chinese, Japanese, Korean, and
+ traditional Vietnamese writing systems are often called 'CJK
+ characters' after the initial letters of the language names in
+ English. They are also called "Han characters", after the term in
+ Chinese that is often used for these characters. <NONE>
+
+ Note that CJK and Han characters do not include the phonetic
+ characters used in the Japanese and Korean languages.
+
+ In ISO/IEC 10646, the Han characters were "unified", meaning that
+ each set of Han characters from Japanese, Chinese, and/or Korean
+ that had the same origin was assigned a single code point. The
+ positive result of this was that many fewer code points were
+ needed to represent Han; the negative result of this was that
+ characters that people who write the three languages think are
+ different have the same code point. There is a great deal of
+ disagreement on the nature, the origin, and the severity of the
+ problems caused by Han unification.
+
+
+
+
+
+
+
+Hoffman Informational [Page 23]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ translation
+
+ The process of conveying the meaning of some passage of text in
+ one language, so that it can be expressed equivalently in another
+ language. <NONE>
+
+ Many language translation systems are inexact and cannot be
+ applied repeatedly to go from one language to another to another.
+
+ transliteration
+
+ The process of representing the characters of an alphabetical or
+ syllabic system of writing by the characters of a conversion
+ alphabet. <NONE>
+
+ Many script transliterations are exact, and many have perfect
+ round-trip mappings. The notable exception to this is
+ romanization, described above. Transliteration involves
+ converting text expressed in one script into another script,
+ generally on a letter-by-letter basis.
+
+ transcription
+
+ The process of systematically writing the sounds of some passage
+ of spoken language, generally with the use of a technical phonetic
+ alphabet (usually Latin-based) or other systematic transcriptional
+ orthography. Transcription also sometimes refers to the
+ conversion of written text into a transcribed (usually Latin-
+ based) form, based on the sound of the text as if it had been
+ spoken. <NONE>
+
+ Unlike transliterations, which are generally designed to be
+ round-trip convertible, transcriptions of written material are
+ almost never round-trip convertible to their original form.
+
+ regular expressions
+
+ Regular expressions provide a mechanism to select specific strings
+ from a set of character strings. Regular expressions are a
+ language used to search for text within strings, and possibly
+ modify the text found with other text. <NONE>
+
+ Pattern matching for text involves being able to represent one or
+ more code points in an abstract notation, such as searching for
+ all capital Latin letters or all punctuation. The most common
+ mechanism in IETF protocols for naming such patterns is the use of
+ regular expressions. There is no single regular expression
+ language, but there are numerous very similar dialects.
+
+
+
+Hoffman Informational [Page 24]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ The Unicode Consortium has a good discussion about how to adapt
+ regular expression engines to use Unicode. [UTR18]
+
+ private use
+
+ ISO/IEC 10646 code points from U+E000 to U+F8FF, U+F0000 to
+ U+FFFFD, and U+100000 to U+10FFFD are available for private use.
+ This refers to code points of the standard whose interpretation is
+ not specified by the standard and whose use may be determined by
+ private agreement among cooperating users. <UNICODE>
+
+ The use of these "private use" characters is defined by the
+ parties who transmit and receive them, and is thus not appropriate
+ for standardization. (The IETF has a long history of private use
+ names for things such as "x-" names in MIME types, charsets, and
+ languages. The experience with these has been quite negative,
+ with many implementors assuming that private use names are in fact
+ public and long-lived.)
+
+8. Security Considerations
+
+ Security is not discussed in this document.
+
+9. References
+
+9.1 Normative References
+
+ [ISOIEC10646] ISO/IEC 10646-1:2000. International Standard --
+ Information technology -- Universal Multiple-Octet
+ Coded Character Set (UCS) -- Part 1: Architecture and
+ Basic Multilingual Plane, 2000.
+
+ [UNICODE] The Unicode Standard, Version 3.2.0 is defined by The
+ Unicode Standard, Version 3.0 (Reading, MA, Addison-
+ Wesley, 2000. ISBN 0-201-61633-5), as amended by the
+ Unicode Standard Annex #27: Unicode 3.1
+ (http://www.unicode.org/reports/tr27/) and by the
+ Unicode Standard Annex #28: Unicode 3.2
+ (http://www.unicode.org/reports/tr28/), The Unicode
+ Consortium, 2002.
+
+
+
+
+
+
+
+
+
+
+
+Hoffman Informational [Page 25]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+9.2 Informative References
+
+ [CHARMOD] Character Model for the World Wide Web 1.0, W3C,
+ <http://www.w3.org/TR/charmod/>.
+
+ [FRAMEWORK] ISO/IEC TR 11017:1997(E). Information technology -
+ Framework for internationalization, prepared by ISO/IEC
+ JTC 1/SC 22/WG 20, 1997.
+
+ [ISO 639] ISO 639:2000 (E/F) - Code for the representation of
+ names of languages, 2000.
+
+ [ISO 3166] ISO 3166:1988 (E/F) - Codes for the representation of
+ names of countries, 2000.
+
+ [RFC2045] Freed, N. and N. Borenstein, "MIME Part One: Format of
+ Internet Message Bodies", November 1996.
+
+ [RFC2047] Moore, K., "MIME Part Three: Message Header Extensions
+ for Non-ASCII Text", RFC 2047, November 1996.
+
+ [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and
+ Languages", BCP 18, RFC 2277, January 1998.
+
+ [RFC2279] Yergeau, F., "UTF-8, a transformation format of ISO
+ 10646", RFC 2279, January 1998.
+
+ [RFC2781] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO
+ 10646", RFC 2781, February 2000.
+
+ [RFC2822] Resnick, P., "Internet Message Format", RFC 2822, April
+ 2001.
+
+ [RFC3066] Alvestrand, H., "Tags for the Identification of
+ Languages", BCP 47, RFC 3066, January 2001.
+
+ [US-ASCII] Coded Character Set -- 7-bit American Standard Code for
+ Information Interchange, ANSI X3.4-1986, 1986.
+
+ [UTN6] "BOCU-1: MIME-Compatible Unicode Compression", M.
+ Scherer & M. Davis, Unicode Technical Note #6.
+
+ [UTR6] "A Standard Compression Scheme for Unicode", M. Wolf,
+ et. al., Unicode Technical Report #6.
+
+ [UTR15] "Unicode Normalization Forms", M. Davis & M. Duerst,
+ Unicode Technical Report #15.
+
+
+
+
+Hoffman Informational [Page 26]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ [UTR18] "Unicode Regular Expression Guidelines", M. Davis,
+ Unicode Technical Report #18.
+
+ [UTR19] "UTF-32", M. Davis, Unicode Technical Report #19.
+
+ [UTR22] "Character Mapping Markup Language", M. Davis, Unicode
+ Technical Report #22.
+
+10. Additional Interesting Reading
+
+ ALA-LC Romanization Tables, Randall Barry (ed.), U.S. Library of
+ Congress, 1997, ISBN 0844409405
+
+ Blackwell Encyclopedia of Writing Systems, Florian Coulmas, Blackwell
+ Publishers, 1999, ISBN 063121481X
+
+ The World's Writing Systems, Peter Daniels and William Bright, Oxford
+ University Press, 1996, ISBN 0195079930
+
+ Writing Systems of the World, Akira Nakanishi, Charles E. Tuttle
+ Company, 1980, ISBN 0804816549
+
+11. Index
+
+ alphabetic -- 4.1
+ ASCII-compatible encoding (ACE) -- 6
+ ASN.1 text formats -- 6
+ Base64 -- 6
+ Basic Multilingual Plane (BMP) -- 3.2
+ bidirectional display -- 5
+ BOCU-1 -- 3.2
+ case -- 4
+ character -- 2
+ character encoding form -- 2
+ character encoding scheme -- 2
+ charset -- 2
+ charset identification -- 6
+ CJK characters and Han characters -- 7
+ code chart -- 4
+ code table -- 4
+ coded character -- 2
+ coded character set -- 2
+ combining character -- 4
+ compatibility character -- 4.1
+ composite sequence -- 4
+ control character -- 4.1
+ diacritic -- 4.1
+ displaying and rendering text -- 2
+
+
+
+Hoffman Informational [Page 27]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+ font -- 5
+ formatting character -- 4.1
+ glyph -- 2
+ glyph code -- 2
+ graphic symbol -- 5
+ i18n, l10n -- 2
+ ideographic -- 4.1
+ input methods -- 5
+ internationalization -- 2
+ ISO -- 3.1
+ language -- 2
+ language identification -- 6
+ Latin characters -- 7
+ local and regional standards organizations -- 3.1
+ locale -- 7
+ localization -- 2
+ MIME -- 6
+ multilingual -- 2
+ name spaces -- 6
+ nonspacing character -- 4.1
+ normalization -- 4
+ on-the-wire encoding -- 6
+ parsed text -- 6
+ private use -- 7
+ protocol elements -- 6
+ punctuation -- 4.1
+ quoted printable -- 6
+ regular expressions -- 7
+ rendering rules -- 5
+ romanization -- 7
+ script -- 2
+ SCSU -- 3.2
+ sorting and collation -- 4
+ symbol -- 4.1
+ transcoding -- 2
+ transcription -- 7
+ transfer encoding syntax -- 6
+ translation -- 7
+ transliteration -- 7
+ UCS-2 and UCS-4 -- 3.2
+ undisplayable character -- 5
+ Unicode Consortium -- 3.1
+ UTF-32 -- 3.2
+ UTF-16, UTF-16BE, and UTF-16LE -- 3.2
+ UTF-8 -- 3.2
+ World Wide Web Consortium -- 3.1
+ XML -- 6
+
+
+
+
+Hoffman Informational [Page 28]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+A. Acknowledgements
+
+ The definitions in this document come from many sources, including a
+ wide variety of IETF documents.
+
+ James Seng contributed to the initial outline of this document.
+ Harald Alvestrand and Martin Duerst made extensive useful comments on
+ early versions. Others who contributed to the development include:
+
+ Dan Kohn
+ Jacob Palme
+ Johan van Wingen
+ Peter Constable
+ Yuri Demchenko
+ Susan Harris
+ Zita Wenzel
+ John Klensin
+ Henning Schulzrinne
+ Leslie Daigle
+ Markus Scherer
+ Ken Whistler
+
+B. Author's Address
+
+ Paul Hoffman
+ Internet Mail Consortium and VPN Consortium
+ 127 Segre Place
+ Santa Cruz, CA 95060 USA
+
+ EMail: paul.hoffman@imc.org and paul.hoffman@vpnc.org
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Hoffman Informational [Page 29]
+
+RFC 3536 Terminology Used in Internationalization in the IETF May 2003
+
+
+Full Copyright Statement
+
+ Copyright (C) The Internet Society (2003). All Rights Reserved.
+
+ This document and translations of it may be copied and furnished to
+ others, and derivative works that comment on or otherwise explain it
+ or assist in its implementation may be prepared, copied, published
+ and distributed, in whole or in part, without restriction of any
+ kind, provided that the above copyright notice and this paragraph are
+ included on all such copies and derivative works. However, this
+ document itself may not be modified in any way, such as by removing
+ the copyright notice or references to the Internet Society or other
+ Internet organizations, except as needed for the purpose of
+ developing Internet standards in which case the procedures for
+ copyrights defined in the Internet Standards process must be
+ followed, or as required to translate it into languages other than
+ English.
+
+ The limited permissions granted above are perpetual and will not be
+ revoked by the Internet Society or its successors or assigns.
+
+ This document and the information contained herein is provided on an
+ "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
+ TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
+ BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
+ HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
+ MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+
+Acknowledgement
+
+ Funding for the RFC Editor function is currently provided by the
+ Internet Society.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Hoffman Informational [Page 30]
+