doc: Add RFC documents

author: Thomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
committer: Thomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
commit: 4bfd864f10b68b71482b35c818559068ef8d5797 (patch)
tree: e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc6365.txt
parent: ea76e11061bda059ae9f9ad130a9895cc85607db (diff)
1 files changed, 2635 insertions, 0 deletions
diff --git a/doc/rfc/rfc6365.txt b/doc/rfc/rfc6365.txt
new file mode 100644
index 0000000..e0cfa2d
--- /dev/null
+++ b/doc/rfc/rfc6365.txt
@@ -0,0 +1,2635 @@
+
+
+
+
+
+
+Internet Engineering Task Force (IETF)                        P. Hoffman
+Request for Comments: 6365                                VPN Consortium
+BCP: 166                                                      J. Klensin
+Obsoletes: 3536                                           September 2011
+Category: Best Current Practice
+ISSN: 2070-1721
+
+
+          Terminology Used in Internationalization in the IETF
+
+Abstract
+
+   This document provides a list of terms used in the IETF when
+   discussing internationalization.  The purpose is to help frame
+   discussions of internationalization in the various areas of the IETF
+   and to help introduce the main concepts to IETF participants.
+
+Status of This Memo
+
+   This memo documents an Internet Best Current Practice.
+
+   This document is a product of the Internet Engineering Task Force
+   (IETF).  It represents the consensus of the IETF community.  It has
+   received public review and has been approved for publication by the
+   Internet Engineering Steering Group (IESG).  Further information on
+   BCPs is available in Section 2 of RFC 5741.
+
+   Information about the current status of this document, any errata,
+   and how to provide feedback on it may be obtained at
+   http://www.rfc-editor.org/info/rfc6365.
+
+Copyright Notice
+
+   Copyright (c) 2011 IETF Trust and the persons identified as the
+   document authors.  All rights reserved.
+
+   This document is subject to BCP 78 and the IETF Trust's Legal
+   Provisions Relating to IETF Documents
+   (http://trustee.ietf.org/license-info) in effect on the date of
+   publication of this document.  Please review these documents
+   carefully, as they describe your rights and restrictions with respect
+   to this document.  Code Components extracted from this document must
+   include Simplified BSD License text as described in Section 4.e of
+   the Trust Legal Provisions and are provided without warranty as
+   described in the Simplified BSD License.
+
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                 [Page 1]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+Table of Contents
+
+   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
+     1.1.  Purpose of this Document . . . . . . . . . . . . . . . . .  3
+     1.2.  Format of the Definitions in This Document . . . . . . . .  4
+     1.3.  Normative Terminology  . . . . . . . . . . . . . . . . . .  4
+   2.  Fundamental Terms  . . . . . . . . . . . . . . . . . . . . . .  5
+   3.  Standards Bodies and Standards . . . . . . . . . . . . . . . . 10
+     3.1.  Standards Bodies . . . . . . . . . . . . . . . . . . . . . 11
+     3.2.  Encodings and Transformation Formats of ISO/IEC 10646  . . 13
+     3.3.  Native CCSs and Charsets . . . . . . . . . . . . . . . . . 15
+   4.  Character Issues . . . . . . . . . . . . . . . . . . . . . . . 16
+     4.1.  Types of Characters  . . . . . . . . . . . . . . . . . . . 20
+     4.2.  Differentiation of Subsets . . . . . . . . . . . . . . . . 23
+   5.  User Interface for Text  . . . . . . . . . . . . . . . . . . . 24
+   6.  Text in Current IETF Protocols . . . . . . . . . . . . . . . . 27
+   7.  Terms Associated with Internationalized Domain Names . . . . . 31
+     7.1.  IDNA Terminology . . . . . . . . . . . . . . . . . . . . . 31
+     7.2.  Character Relationships and Variants . . . . . . . . . . . 32
+   8.  Other Common Terms in Internationalization . . . . . . . . . . 33
+   9.  Security Considerations  . . . . . . . . . . . . . . . . . . . 36
+   10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 37
+     10.1. Normative References . . . . . . . . . . . . . . . . . . . 37
+     10.2. Informative References . . . . . . . . . . . . . . . . . . 37
+   Appendix A.  Additional Interesting Reading  . . . . . . . . . . . 41
+   Appendix B.  Acknowledgements  . . . . . . . . . . . . . . . . . . 42
+   Appendix C.  Significant Changes from RFC 3536 . . . . . . . . . . 42
+   Index  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                 [Page 2]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+1.  Introduction
+
+   As the IETF Character Set Policy specification [RFC2277] summarizes:
+   "Internationalization is for humans.  This means that protocols are
+   not subject to internationalization; text strings are."  Many
+   protocols throughout the IETF use text strings that are entered by,
+   or are visible to, humans.  Subject only to the limitations of their
+   own knowledge and facilities, it should be possible for anyone to
+   enter or read these text strings, which means that Internet users
+   must be able to enter text using typical input methods and have it be
+   displayed in any human language.  Further, text containing any
+   character should be able to be passed between Internet applications
+   easily.  This is the challenge of internationalization.
+
+1.1.  Purpose of this Document
+
+   This document provides a glossary of terms used in the IETF when
+   discussing internationalization.  The purpose is to help frame
+   discussions of internationalization in the various areas of the IETF
+   and to help introduce the main concepts to IETF participants.
+
+   Internationalization is discussed in many working groups of the IETF.
+   However, few working groups have internationalization experts.  When
+   designing or updating protocols, the question often comes up "Should
+   we internationalize this?" (or, more likely, "Do we have to
+   internationalize this?").
+
+   This document gives an overview of internationalization terminology
+   as it applies to IETF standards work by lightly covering the many
+   aspects of internationalization and the vocabulary associated with
+   those topics.  Some of the overview is somewhat tutorial in nature.
+   It is not meant to be a complete description of internationalization.
+   The definitions here SHOULD be used by IETF standards.  IETF
+   standards that explicitly want to create different definitions for
+   the terms defined here can do so, but unless an alternate definition
+   is provided the definitions of the terms in this document apply.
+   IETF standards that have a requirement for different definitions are
+   encouraged, for clarity's sake, to find terms different than the ones
+   defined here.  Some of the definitions in this document come from
+   earlier IETF documents and books.
+
+   As in many fields, there is disagreement in the internationalization
+   community on definitions for many words.  The topic of language
+   brings up particularly passionate opinions for experts and non-
+   experts alike.  This document attempts to define terms in a way that
+   will be most useful to the IETF audience.
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                 [Page 3]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+   This document uses definitions from many documents that have been
+   developed inside and outside the IETF.  The primary documents used
+   are:
+
+   o  ISO/IEC 10646 [ISOIEC10646]
+
+   o  The Unicode Standard [UNICODE]
+
+   o  W3C Character Model [CHARMOD]
+
+   o  IETF RFCs, including the Character Set Policy specification
+      [RFC2277] and the domain name internationalization standard
+      [RFC5890]
+
+1.2.  Format of the Definitions in This Document
+
+   In the body of this document, the source for the definition is shown
+   in angle brackets, such as "<ISOIEC10646>".  Many definitions are
+   shown as "<RFC6365>", which means that the definitions were crafted
+   originally for this document.  The angle bracket notation for the
+   source of definitions is different than the square bracket notation
+   used for references to documents, such as in the paragraph above;
+   these references are given in the reference sections of this
+   document.
+
+   For some terms, there are commentary and examples after the
+   definitions.  In those cases, the part before the angle brackets is
+   the definition that comes from the original source, and the part
+   after the angle brackets is commentary that is not a definition (such
+   as an example or further exposition).
+
+   Examples in this document use the notation for code points and names
+   from the Unicode Standard [UNICODE] and ISO/IEC 10646 [ISOIEC10646].
+   For example, the letter "a" may be represented as either "U+0061" or
+   "LATIN SMALL LETTER A".  See RFC 5137 [RFC5137] for a description of
+   this notation.
+
+1.3.  Normative Terminology
+
+   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
+   document are to be interpreted as described in RFC 2119 [RFC2119].
+
+
+
+
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                 [Page 4]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+2.  Fundamental Terms
+
+   This section covers basic topics that are needed for almost anyone
+   who is involved with making IETF protocols more friendly to non-ASCII
+   text (see Section 4.2) and with other aspects of
+   internationalization.
+
+   language
+
+      A language is a way that humans communicate.  The use of language
+      occurs in many forms, the most common of which are speech,
+      writing, and signing. <RFC6365>
+
+      Some languages have a close relationship between the written and
+      spoken forms, while others have a looser relationship.  The so-
+      called LTRU (Language Tag Registry Update) standards [RFC5646]
+      [RFC4647] discuss languages in more detail and provide identifiers
+      for languages for use in Internet protocols.  Note that computer
+      languages are explicitly excluded from this definition.
+
+   script
+
+      A set of graphic characters used for the written form of one or
+      more languages. <ISOIEC10646>
+
+      Examples of scripts are Latin, Cyrillic, Greek, Arabic, and Han
+      (the characters, often called ideographs after a subset of them,
+      used in writing Chinese, Japanese, and Korean).  RFC 2277
+      discusses scripts in detail.
+
+      It is common for internationalization novices to mix up the terms
+      "language" and "script".  This can be a problem in protocols that
+      differentiate the two.  Almost all protocols that are designed (or
+      were re-designed) to handle non-ASCII text deal with scripts (the
+      written systems) or characters, while fewer actually deal with
+      languages.
+
+      A single name can mean either a language or a script; for example,
+      "Arabic" is both the name of a language and the name of a script.
+      In fact, many scripts borrow their names from the names of
+      languages.  Further, many scripts are used to write more than one
+      language; for example, the Russian and Bulgarian languages are
+      written in the Cyrillic script.  Some languages can be expressed
+      using different scripts or were used with different scripts at
+      different times; the Mongolian language can be written in either
+      the Mongolian or Cyrillic scripts; Malay is primarily written in
+      Latin script today, but the earlier, Arabic-script-based, Jawa
+      form is still in use; and a number of languages were converted
+
+
+
+Hoffman & Klensin         Best Current Practice                 [Page 5]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+      from other scripts to Cyrillic in the first half of the last
+      century, some of which have switched again more recently.
+      Further, some languages are normally expressed with more than one
+      script at the same time; for example, the Japanese language is
+      normally expressed in the Kanji (Han), Katakana, and Hiragana
+      scripts in a single string of text.
+
+   writing system
+
+      A set of rules for using one or more scripts to write a particular
+      language.  Examples include the American English writing system,
+      the British English writing system, the French writing system, and
+      the Japanese writing system. <UNICODE>
+
+   character
+
+      A member of a set of elements used for the organization, control,
+      or representation of data. <ISOIEC10646>
+
+      There are at least three common definitions of the word
+      "character":
+
+      *  a general description of a text entity
+
+      *  a unit of a writing system, often synonymous with "letter" or
+         similar terms, but generalized to include digits and symbols of
+         various sorts
+
+      *  the encoded entity itself
+
+
+      When people talk about characters, they usually intend one of the
+      first two definitions.  The term "character" is often abbreviated
+      as "char".
+
+      A particular character is identified by its name, not by its
+      shape.  A name may suggest a meaning, but the character may be
+      used for representing other meanings as well.  A name may suggest
+      a shape, but that does not imply that only that shape is commonly
+      used in print, nor that the particular shape is associated only
+      with that name.
+
+   coded character
+
+      A character together with its coded representation. <ISOIEC10646>
+
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                 [Page 6]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+   coded character set
+
+      A coded character set (CCS) is a set of unambiguous rules that
+      establishes a character set and the relationship between the
+      characters of the set and their coded representation.
+      <ISOIEC10646>
+
+   character encoding form
+
+      A character encoding form is a mapping from a coded character set
+      (CCS) to the actual code units used to represent the data.
+      <UNICODE>
+
+   repertoire
+
+      The collection of characters included in a character set.  Also
+      called a character repertoire. <UNICODE>
+
+   glyph
+
+      A glyph is an image of a character that can be displayed after
+      being imaged onto a display surface. <RFC6365>
+
+      The Unicode Standard has a different definition that refers to an
+      abstract form that may represent different images when the same
+      character is rendered under different circumstances.
+
+   glyph code
+
+      A glyph code is a numeric code that refers to a glyph.  Usually,
+      the glyphs contained in a font are referenced by their glyph code.
+      Glyph codes are local to a particular font; that is, a different
+      font containing the same glyphs may use different codes. <UNICODE>
+
+   transcoding
+
+      Transcoding is the process of converting text data from one
+      character encoding form to another.  Transcoders work only at the
+      level of character encoding and do not parse the text.  Note:
+      Transcoding may involve one-to-one, many-to-one, one-to-many, or
+      many-to-many mappings.  Because some legacy mappings are glyphic,
+      they may not only be many-to-many, but also unordered: thus XYZ
+      may map to yxz. <CHARMOD>
+
+      In this definition, "many-to-one" means a sequence of characters
+      mapped to a single character.  The "many" does not mean
+      alternative characters that map to the single character.
+
+
+
+
+Hoffman & Klensin         Best Current Practice                 [Page 7]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+   character encoding scheme
+
+      A character encoding scheme (CES) is a character encoding form
+      plus byte serialization.  There are many character encoding
+      schemes in Unicode, such as UTF-8 and UTF-16BE. <UNICODE>
+
+      Some CESs are associated with a single CCS; for example, UTF-8
+      [RFC3629] applies only to the identical CCSs of ISO/IEC 10646 and
+      Unicode.  Other CESs, such as ISO 2022, are associated with many
+      CCSs.
+
+   charset
+
+      A charset is a method of mapping a sequence of octets to a
+      sequence of abstract characters.  A charset is, in effect, a
+      combination of one or more CCSs with a CES.  Charset names are
+      registered by the IANA according to procedures documented in
+      [RFC2978]. <RFC6365>
+
+      Many protocol definitions use the term "character set" in their
+      descriptions.  The terms "charset", or "character encoding scheme"
+      and "coded character set", are strongly preferred over the term
+      "character set" because "character set" has other definitions in
+      other contexts, particularly outside the IETF.  When reading IETF
+      standards that use "character set" without defining the term, they
+      usually mean "a specific combination of one CCS with a CES",
+      particularly when they are talking about the "US-ASCII character
+      set".
+
+   internationalization
+
+      In the IETF, "internationalization" means to add or improve the
+      handling of non-ASCII text in a protocol. <RFC6365>  A different
+      perspective, more appropriate to protocols that are designed for
+      global use from the beginning, is the definition used by W3C:
+
+         "Internationalization is the design and development of a
+         product, application or document content that enables easy
+         localization for target audiences that vary in culture, region,
+         or language."  [W3C-i18n-Def]
+
+      Many protocols that handle text only handle one charset
+      (US-ASCII), or leave the question of what CCS and encoding are
+      used up to local guesswork (which leads, of course, to
+      interoperability problems).  If multiple charsets are permitted,
+      they must be explicitly identified [RFC2277].  Adding non-ASCII
+      text to a protocol allows the protocol to handle more scripts,
+      hopefully all of the ones useful in the world.  In today's world,
+
+
+
+Hoffman & Klensin         Best Current Practice                 [Page 8]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+      that is normally best accomplished by allowing Unicode encoded in
+      UTF-8 only, thereby shifting conversion issues away from
+      individual choices.
+
+   localization
+
+      The process of adapting an internationalized application platform
+      or application to a specific cultural environment.  In
+      localization, the same semantics are preserved while the syntax
+      may be changed.  [FRAMEWORK]
+
+      Localization is the act of tailoring an application for a
+      different language or script or culture.  Some internationalized
+      applications can handle a wide variety of languages.  Typical
+      users only understand a small number of languages, so the program
+      must be tailored to interact with users in just the languages they
+      know.
+
+      The major work of localization is translating the user interface
+      and documentation.  Localization involves not only changing the
+      language interaction, but also other relevant changes such as
+      display of numbers, dates, currency, and so on.  The better
+      internationalized an application is, the easier it is to localize
+      it for a particular language and character encoding scheme.
+
+      Localization is rarely an IETF matter, and protocols that are
+      merely localized, even if they are serially localized for several
+      locations, are generally considered unsatisfactory for the global
+      Internet.
+
+      Do not confuse "localization" with "locale", which is described in
+      Section 8 of this document.
+
+   i18n, l10n
+
+      These are abbreviations for "internationalization" and
+      "localization". <RFC6365>
+
+      "18" is the number of characters between the "i" and the "n" in
+      "internationalization", and "10" is the number of characters
+      between the "l" and the "n" in "localization".
+
+
+
+
+
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                 [Page 9]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+   multilingual
+
+      The term "multilingual" has many widely varying definitions and
+      thus is not recommended for use in standards.  Some of the
+      definitions relate to the ability to handle international
+      characters; other definitions relate to the ability to handle
+      multiple charsets; and still others relate to the ability to
+      handle multiple languages. <RFC6365>
+
+   displaying and rendering text
+
+      To display text, a system puts characters on a visual display
+      device such as a screen or a printer.  To render text, a system
+      analyzes the character input to determine how to display the text.
+      The terms "display" and "render" are sometimes used
+      interchangeably.  Note, however, that text might be rendered as
+      audio and/or tactile output, such as in systems that have been
+      designed for people with visual disabilities. <RFC6365>
+
+      Combining characters modify the display of the character (or, in
+      some cases, characters) that precede them.  When rendering such
+      text, the display engine must either find the glyph in the font
+      that represents the base character and all of the combining
+      characters, or it must render the combination itself.  Such
+      rendering can be straightforward, but it is sometimes complicated
+      when the combining marks interact with each other, such as when
+      there are two combining marks that would appear above the same
+      character.  Formatting characters can also change the way that a
+      renderer would display text.  Rendering can also be difficult for
+      some scripts that have complex display rules for base characters,
+      such as Arabic and Indic scripts.
+
+3.  Standards Bodies and Standards
+
+   This section describes some of the standards bodies and standards
+   that appear in discussions of internationalization in the IETF.  This
+   is an incomplete and possibly over-full list; listing too few bodies
+   or standards can be just as politically dangerous as listing too
+   many.  Note that there are many other bodies that deal with
+   internationalization; however, few if any of them appear commonly in
+   IETF standards work.
+
+
+
+
+
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 10]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+3.1.  Standards Bodies
+
+   ISO and ISO/IEC JTC 1
+
+      The International Organization for Standardization has been
+      involved with standards for characters since before the IETF was
+      started.  ISO is a non-governmental group made up of national
+      bodies.  Most of ISO's work in information technology is performed
+      jointly with a similar body, the International Electrotechnical
+      Commission (IEC) through a joint committee known as "JTC 1".  ISO
+      and ISO/IEC JTC 1 have many diverse standards in the international
+      characters area; the one that is most used in the IETF is commonly
+      referred to as "ISO/IEC 10646", sometimes with a specific date.
+      ISO/IEC 10646 describes a CCS that covers almost all known written
+      characters in use today.
+
+      ISO/IEC 10646 is controlled by the group known as "ISO/IEC JTC 1/
+      SC 2 WG2", often called "SC2/WG2" or "WG2" for short.  ISO
+      standards go through many steps before being finished, and years
+      often go by between changes to the base ISO/IEC 10646 standard
+      although amendments are now issued to track Unicode changes.
+      Information on WG2, and its work products, can be found at
+      <http://www.dkuug.dk/JTC1/SC2/WG2/>.  Information on SC2, and its
+      work products, can be found at <http://www.iso.org/iso/
+      standards_development/technical_committees/
+      list_of_iso_technical_committees/
+      iso_technical_committee.htm?commid=45050>
+
+      The standard comes as a base part and a series of attachments or
+      amendments.  It is available in PDF form for downloading or in a
+      CD-ROM version.  One example of how to cite the standard is given
+      in [RFC3629].  Any standard that cites ISO/IEC 10646 needs to
+      evaluate how to handle the versioning problem that is relevant to
+      the protocol's needs.
+
+      ISO is responsible for other standards that might be of interest
+      to protocol developers concerned about internationalization.
+      ISO 639 [ISO639] specifies the names of languages and forms part
+      of the basis for the IETF's Language Tag work [RFC5646].  ISO 3166
+      [ISO3166] specifies the names and code abbreviations for countries
+      and territories and is used in several protocols and databases
+      including names for country-code top level domain names.  The
+      responsibilities of ISO TC 46 on Information and Documentation
+      <http://www.iso.org/iso/standards_development/
+      technical_committees/list_of_iso_technical_committees/
+      iso_technical_committee.htm?commid=48750> include a series of
+      standards for transliteration of various languages into Latin
+      characters.
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 11]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+      Another relevant ISO group was JTC 1/SC22/WG20, which was
+      responsible for internationalization in JTC 1, such as for
+      international string ordering.  Information on WG20, and its work
+      products, can be found at <http://www.dkuug.dk/jtc1/sc22/wg20/>.
+      The specific tasks of SC22/WG20 were moved from SC22 into SC2, and
+      there has been little significant activity since that occurred.
+
+   Unicode Consortium
+
+      The second important group for international character standards
+      is the Unicode Consortium.  The Unicode Consortium is a trade
+      association of companies, governments, and other groups interested
+      in promoting the Unicode Standard [UNICODE].  The Unicode Standard
+      is a CCS whose repertoire and code points are identical to
+      ISO/IEC 10646.  The Unicode Consortium has added features to the
+      base CCS that make it more useful in protocols, such as defining
+      attributes for each character.  Examples of these attributes
+      include case conversion and numeric properties.
+
+      The actual technical and definitional work of the Unicode
+      Consortium is done in the Unicode Technical Committee (UTC).  The
+      terms "UTC" and "Unicode Consortium" are often treated,
+      imprecisely, as synonymous in the IETF.
+
+      The Unicode Consortium publishes addenda to the Unicode Standard
+      as Unicode Technical Reports.  There are many types of technical
+      reports at various stages of maturity.  The Unicode Standard and
+      affiliated technical reports can be found at
+      <http://www.unicode.org/>.
+
+      A reciprocal agreement between the Unicode Consortium and
+      ISO/IEC JTC 1/SC 2 provides for ISO/IEC 10646 and The Unicode
+      Standard to track each other for definitions of characters and
+      assignments of code points.  Updates, often in the form of
+      amendments, to the former sometimes lag updates to the latter for
+      a short period, but the gap has rarely been significant in recent
+      years.
+
+      At the time that the IETF character set policy [RFC2277] was
+      established and the first version of this terminology
+      specification was published, there was a strong preference in the
+      IETF community for references to ISO/IEC 10646 (rather than
+      Unicode) when possible.  That preference largely reflected a more
+      general IETF preference for referencing established open
+      international standards over specifications from consortia.
+      However, the Unicode definitions of character properties and
+      classes are not part of ISO/IEC 10646.  Because IETF
+      specifications are increasingly dependent on those definitions
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 12]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+      (for example, see the explanation in Section 4.2) and the Unicode
+      specifications are freely available online in convenient machine-
+      readable form, the IETF's preference has shifted to referencing
+      the Unicode Standard.  The latter is especially important when
+      version consistency between code points (either standard) and
+      Unicode properties (Unicode only) is required.
+
+   World Wide Web Consortium (W3C)
+
+      This group created and maintains the standard for XML, the markup
+      language for text that has become very popular.  XML has always
+      been fully internationalized so that there is no need for a new
+      version to handle international text.  However, in some
+      circumstances, XML files may be sensitive to differences among
+      Unicode versions.
+
+   local and regional standards organizations
+
+      Just as there are many native CCSs and charsets, there are many
+      local and regional standards organizations to create and support
+      them.  Common examples of these are ANSI (United States), CEN/ISSS
+      (Europe), JIS (Japan), and SAC (China).
+
+3.2.  Encodings and Transformation Formats of ISO/IEC 10646
+
+   Characters in the ISO/IEC 10646 CCS can be expressed in many ways.
+   Historically, "encoding forms" are both direct addressing methods,
+   while "transformation formats" are methods for expressing encoding
+   forms as bits on the wire.  That distinction has mostly disappeared
+   in recent years.
+
+   Documents that discuss characters in the ISO/IEC 10646 CCS often need
+   to list specific characters.  RFC 5137 describes the common methods
+   for doing so in IETF documents, and these practices have been adopted
+   by many other communities as well.
+
+   Basic Multilingual Plane (BMP)
+
+      The BMP is composed of the first 2^16 code points in ISO/IEC 10646
+      and contains almost all characters in contemporary use.  The BMP
+      is also called "Plane 0".
+
+   UCS-2 and UCS-4
+
+      UCS-2 and UCS-4 are the two encoding forms historically defined
+      for ISO/IEC 10646.  UCS-2 addresses only the BMP.  Because many
+      useful characters (such as many Han characters) have been defined
+      outside of the BMP, many people consider UCS-2 to be obsolete.
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 13]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+      UCS-4 addresses the entire range of code points from ISO/IEC 10646
+      (by agreement between ISO/IEC JTC 1 SC2 and the Unicode
+      Consortium, a range from 0..0x10FFFF) as 32-bit values with zero
+      padding to the left.  UCS-4 is identical to UTF-32BE (without use
+      of a BOM (see below)); UTF-32BE is now the preferred term.
+
+   UTF-8
+
+      UTF-8 [RFC3629] is the preferred encoding for IETF protocols.
+      Characters in the BMP are encoded as one, two, or three octets.
+      Characters outside the BMP are encoded as four octets.  Characters
+      from the US-ASCII repertoire have the same on-the-wire
+      representation in UTF-8 as they do in US-ASCII.  The IETF-specific
+      definition of UTF-8 in RFC 3629 is identical to that in recent
+      versions of the Unicode Standard (e.g., in Section 3.9 of Version
+      6.0 [UNICODE]).
+
+   UTF-16, UTF-16BE, and UTF-16LE
+
+      UTF-16, UTF-16BE, and UTF-16LE, three transformation formats
+      described in [RFC2781] and defined in The Unicode Standard
+      (Sections 3.9 and 16.8 of Version 6.0), are not required by any
+      IETF standards, and are thus used much less often in protocols
+      than UTF-8.  Characters in the BMP are always encoded as two
+      octets, and characters outside the BMP are encoded as four octets
+      using a "surrogate pair" arrangement.  The latter is not part of
+      UCS-2, marking the difference between UTF-16 and UCS-2.  The three
+      UTF-16 formats differ based on the order of the octets and the
+      presence or absence of a special lead-in ordering identifier
+      called the "byte order mark" or "BOM".
+
+   UTF-32
+
+      The Unicode Consortium and ISO/IEC JTC 1 have defined UTF-32 as a
+      transformation format that incorporates the integer code point
+      value right-justified in a 32-bit field.  As with UTF-16, the byte
+      order mark (BOM) can be used and UTF-32BE and UTF-32LE are
+      defined.  UTF-32 and UCS-4 are essentially equivalent and the
+      terms are often used interchangeably.
+
+   SCSU and BOCU-1
+
+      The Unicode Consortium has defined an encoding, SCSU [UTR6], which
+      is designed to offer good compression for typical text.  A
+      different encoding that is meant to be MIME-friendly, BOCU-1, is
+      described in [UTN6].  Although compression is attractive, as
+      opposed to UTF-8, neither of these (at the time of this writing)
+      has attracted much interest.
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 14]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+      The compression provided as a side effect of the Punycode
+      algorithm [RFC3492] is heavily used in some contexts, especially
+      IDNA [RFC5890], but imposes some restrictions.  (See also
+      Section 7.)
+
+3.3.  Native CCSs and Charsets
+
+   Before ISO/IEC 10646 was developed, many countries developed their
+   own CCSs and charsets.  Some of these were adopted into international
+   standards for the relevant scripts or writing systems.  Many dozen of
+   these are in common use on the Internet today.  Examples include
+   ISO 8859-5 for Cyrillic and Shift-JIS for Japanese scripts.
+
+   The official list of the registered charset names for use with IETF
+   protocols is maintained by IANA and can be found at
+   <http://www.iana.org/assignments/character-sets>.  The list contains
+   preferred names and aliases.  Note that this list has historically
+   contained many errors, such as names that are in fact not charsets or
+   references that do not give enough detail to reliably map names to
+   charsets.
+
+   Probably the most well-known native CCS is ASCII [US-ASCII].  This
+   CCS is used as the basis for keywords and parameter names in many
+   IETF protocols, and as the sole CCS in numerous IETF protocols that
+   have not yet been internationalized.  ASCII became the basis for
+   ISO/IEC 646 which, in turn, formed the basis for many national and
+   international standards, such as the ISO 8859 series, that mix Basic
+   Latin characters with characters from another script.
+
+   It is important to note that, strictly speaking, "ASCII" is a CCS and
+   repertoire, not an encoding.  The encoding used for ASCII in IETF
+   protocols involves the 7-bit integer ASCII code point right-justified
+   in an 8-bit field and is sometimes described as the "Network Virtual
+   Terminal" or "NVT" encoding [RFC5198].  Less formally, "ASCII" and
+   "NVT" are often used interchangeably.  However, "non-ASCII" refers
+   only to characters outside the ASCII repertoire and is not linked to
+   a specific encoding.  See Section 4.2.
+
+   A Unicode publication describes issues involved in mapping character
+   data between charsets, and an XML format for mapping table data
+   [UTR22].
+
+
+
+
+
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 15]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+4.  Character Issues
+
+   This section contains terms and topics that are commonly used in
+   character handling and therefore are of concern to people adding non-
+   ASCII text handling to protocols.  These topics are standardized
+   outside the IETF.
+
+   code point
+
+      A value in the codespace of a repertoire.  For all common
+      repertoires developed in recent years, code point values are
+      integers (code points for ASCII and its immediate descendants were
+      defined in terms of column and row positions of a table).
+
+   combining character
+
+      A member of an identified subset of the coded character set of
+      ISO/IEC 10646 intended for combination with the preceding non-
+      combining graphic character, or with a sequence of combining
+      characters preceded by a non-combining character.  Combining
+      characters are inherently non-spacing. <ISOIEC10646>
+
+   composite sequence or combining character sequence
+
+      A sequence of graphic characters consisting of a non-combining
+      character followed by one or more combining characters.  A graphic
+      symbol for a composite sequence generally consists of the
+      combination of the graphic symbols of each character in the
+      sequence.  The Unicode Standard often uses the term "combining
+      character sequence" to refer to composite sequences.  A composite
+      sequence is not a character and therefore is not a member of the
+      repertoire of ISO/IEC 10646. <ISOIEC10646>  However, Unicode now
+      assigns names to some such sequences especially when the names are
+      required to match terminology in other standards [UAX34].
+
+      In some CCSs, some characters consist of combinations of other
+      characters.  For example, the letter "a with acute" might be a
+      combination of the two characters "a" and "combining acute", or it
+      might be a combination of the three characters "a", a non-
+      destructive backspace, and an acute.  In the same or other CCSs,
+      it might be available as a single code point.  The rules for
+      combining two or more characters are called "composition rules",
+      and the rules for taking apart a character into other characters
+      are called "decomposition rules".  The result of decomposition is
+      called a "decomposed character"; the result of composition is
+      usually a "precomposed character".
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 16]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+   normalization
+
+      Normalization is the transformation of data to a normal form, for
+      example, to unify spelling. <UNICODE>
+
+      Note that the phrase "unify spelling" in the definition above does
+      not mean unifying different strings with the same meaning as words
+      (such as "color" and "colour").  Instead, it means unifying
+      different character sequences that are intended to form the same
+      composite characters, such as "<n><combining tilde>" and "<n with
+      tilde>" (where "<n>" is U+006E, "<combining tilde>" is U+0303, and
+      "<n with tilde>" is U+00F1).
+
+      The purpose of normalization is to allow two strings to be
+      compared for equivalence.  The strings "<a><n><combining
+      tilde><o>" and "<a><n with tilde><o>" would be shown identically
+      on a text display device.  If a protocol designer wants those two
+      strings to be considered equivalent during comparison, the
+      protocol must define where normalization occurs.
+
+      The terms "normalization" and "canonicalization" are often used
+      interchangeably.  Generally, they both mean to convert a string of
+      one or more characters into another string based on standardized
+      rules.  However, in Unicode, "canonicalization" or similar terms
+      are used to refer to a particular type of normalization
+      equivalence ("canonical equivalence" in contrast to "compatibility
+      equivalence"), so the term should be used with some care.  Some
+      CCSs allow multiple equivalent representations for a written
+      string; normalization selects one among multiple equivalent
+      representations as a base for reference purposes in comparing
+      strings.  In strings of text, these rules are usually based on
+      decomposing combined characters or composing characters with
+      combining characters.  Unicode Standard Annex #15 [UTR15]
+      describes the process and many forms of normalization in detail.
+      Normalization is important when comparing strings to see if they
+      are the same.
+
+      The Unicode NFC and NFD normalizations support canonical
+      equivalence; NFKC and NFKD support canonical and compatibility
+      equivalence.
+
+
+
+
+
+
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 17]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+   case
+
+      Case is the feature of certain alphabets where the letters have
+      two (or occasionally more) distinct forms.  These forms, which may
+      differ markedly in shape and size, are called the uppercase letter
+      (also known as capital or majuscule) and the lowercase letter
+      (also known as small or minuscule).  Case mapping is the
+      association of the uppercase and lowercase forms of a letter.
+      <UNICODE>
+
+      There is usually (but not always) a one-to-one mapping between the
+      same letter in the two cases.  However, there are many examples of
+      characters that exist in one case but for which there is no
+      corresponding character in the other case or for which there is a
+      special mapping rule, such as the Turkish dotless "i", some Greek
+      characters with modifiers, and characters like the German Sharp S
+      (Eszett) and Greek Final Sigma that traditionally do not have
+      uppercase forms.  Case mapping can even be dependent on locale or
+      language.  Converting text to have only a single case, primarily
+      for comparison purposes, is called "case folding".  Because of the
+      various unusual cases, case mapping can be quite controversial and
+      some case folding algorithms even more so.  For example, some
+      programming languages such as Java have case-folding algorithms
+      that are locale-sensitive; this makes those algorithms incredibly
+      resource-intensive and makes them act differently depending on the
+      location of the system at the time the algorithm is used.
+
+   sorting and collation
+
+      Collating is the process of ordering units of textual information.
+      Collation is usually specific to a particular language or even to
+      a particular application or locale.  It is sometimes known as
+      alphabetizing, although alphabetization is just a special case of
+      sorting and collation. <UNICODE>
+
+      Collation is concerned with the determination of the relative
+      order of any particular pair of strings, and algorithms concerned
+      with collation focus on the problem of providing appropriate
+      weighted keys for string values, to enable binary comparison of
+      the key values to determine the relative ordering of the strings.
+
+      The relative orders of letters in collation sequences can differ
+      widely based on the needs of the system or protocol defining the
+      collation order.  For example, even within ASCII characters, there
+      are two common and very different collation orders: "A, a, B,
+      b,..." and "A, B, C, ..., Z, a, b,...", with additional variations
+      for lowercase first and digits before and after letters.
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 18]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+      In practice, it is rarely necessary to define a collation sequence
+      for characters drawn from different scripts, but arranging such
+      sequences so as to not surprise users is usually particularly
+      problematic.
+
+      Sorting is the process of actually putting data records into
+      specified orders, according to criteria for comparison between the
+      records.  Sorting can apply to any kind of data (including textual
+      data) for which an ordering criterion can be defined.  Algorithms
+      concerned with sorting focus on the problem of performance (in
+      terms of time, memory, or other resources) in actually putting the
+      data records into the desired order.
+
+      A sorting algorithm for string data can be internationalized by
+      providing it with the appropriate collation-weighted keys
+      corresponding to the strings to be ordered.
+
+      Many processes have a need to order strings in a consistent
+      (sorted) sequence.  For only a few CCS/CES combinations, there is
+      an obvious sort order that can be applied without reference to the
+      linguistic meaning of the characters: the code point order is
+      sufficient for sorting.  That is, the code point order is also the
+      order that a person would use in sorting the characters.  For many
+      CCS/CES combinations, the code point order would make no sense to
+      a person and therefore is not useful for sorting if the results
+      will be displayed to a person.
+
+      Code point order is usually not how any human educated by a local
+      school system expects to see strings ordered; if one orders to the
+      expectations of a human, one has a "language-specific" or "human
+      language" sort.  Sorting to code point order will seem
+      inconsistent if the strings are not normalized before sorting
+      because different representations of the same character will sort
+      differently.  This problem may be smaller with a language-specific
+      sort.
+
+   code table
+
+      A code table is a table showing the characters allocated to the
+      octets in a code. <ISOIEC10646>
+
+      Code tables are also commonly called "code charts".
+
+
+
+
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 19]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+4.1.  Types of Characters
+
+   The following definitions of types of characters do not clearly
+   delineate each character into one type, nor do they allow someone to
+   accurately predict what types would apply to a particular character.
+   The definitions are intended for application designers to help them
+   think about the many (sometimes confusing) properties of text.
+
+   alphabetic
+
+      An informative Unicode property.  Characters that are the primary
+      units of alphabets and/or syllabaries, whether combining or non-
+      combining.  This includes composite characters that are canonical
+      equivalents to a combining character sequence of an alphabetic
+      base character plus one or more combining characters: letter
+      digraphs; contextual variants of alphabetic characters; ligatures
+      of alphabetic characters; contextual variants of ligatures;
+      modifier letters; letterlike symbols that are compatibility
+      equivalents of single alphabetic letters; and miscellaneous letter
+      elements. <UNICODE>
+
+   ideographic
+
+      Any symbol that primarily denotes an idea (or meaning) in contrast
+      to a sound (or pronunciation), for example, a symbol showing a
+      telephone or the Han characters used in Chinese, Japanese, and
+      Korean. <UNICODE>
+
+      While Unicode and many other systems use this term to refer to all
+      Han characters, strictly speaking not all of those characters are
+      actually ideographic.  Some are pictographic (such as the
+      telephone example above), some are used phonetically, and so on.
+      However, the convention is to describe the script as ideographic
+      as contrasted to alphabetic.
+
+   digit or number
+
+      All modern writing systems use decimal digits in some form; some
+      older ones use non-positional or other systems.  Different scripts
+      may have their own digits.  Unicode distinguishes between numbers
+      and other kinds of characters by assigning a special General
+      Category value to them and subdividing that value to distinguish
+      between decimal digits, letter digits, and other digits. <UNICODE>
+
+
+
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 20]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+   punctuation
+
+      Characters that separate units of text, such as sentences and
+      phrases, thus clarifying the meaning of the text.  The use of
+      punctuation marks is not limited to prose; they are also used in
+      mathematical and scientific formulae, for example. <UNICODE>
+
+   symbol
+
+      One of a set of characters other than those used for letters,
+      digits, or punctuation, and representing various concepts
+      generally not connected to written language use per se. <RFC6365>
+
+      Examples of symbols include characters for mathematical operators,
+      symbols for optical character recognition (OCR), symbols for box-
+      drawing or graphics, as well as symbols for dingbats, arrows,
+      faces, and geometric shapes.  Unicode has a property that
+      identifies symbol characters.
+
+   nonspacing character
+
+      A combining character whose positioning in presentation is
+      dependent on its base character.  It generally does not consume
+      space along the visual baseline in and of itself. <UNICODE>
+
+      A combining acute accent (U+0301) is an example of a nonspacing
+      character.
+
+   diacritic
+
+      A mark applied or attached to a symbol to create a new symbol that
+      represents a modified or new value.  They can also be marks
+      applied to a symbol irrespective of whether they change the value
+      of that symbol.  In the latter case, the diacritic usually
+      represents an independent value (for example, an accent, tone, or
+      some other linguistic information).  Also called diacritical mark
+      or diacritical. <UNICODE>
+
+   control character
+
+      The 65 characters in the ranges U+0000..U+001F and U+007F..U+009F.
+      The basic space character, U+0020, is often considered as a
+      control character as well, making the total number 66.  They are
+      also known as control codes.  In terminology adopted by Unicode
+      from ASCII and the ISO 8859 standards, these codes are treated as
+      belonging to three ranges: "C0" (for U+0000..U+001F), "C1" (for
+      U+0080...U+009F), and the single control character "DEL" (U+007F).
+      <UNICODE>
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 21]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+      Occasionally, in other vocabularies, the term "control character"
+      is used to describe any character that does not normally have an
+      associated glyph; it is also sometimes used for device control
+      sequences [ISO6429].  Neither of those usages is appropriate to
+      internationalization terminology in the IETF.
+
+   formatting character
+
+      Characters that are inherently invisible but that have an effect
+      on the surrounding characters. <UNICODE>
+
+      Examples of formatting characters include characters for
+      specifying the direction of text and characters that specify how
+      to join multiple characters.
+
+   compatibility character or compatibility variant
+
+      A graphic character included as a coded character of ISO/IEC 10646
+      primarily for compatibility with existing coded character sets.
+      <ISOIEC10646)>
+
+      The Unicode definition of compatibility charter also includes
+      characters that have been incorporated for other reasons.  Their
+      list includes several separate groups of characters included for
+      compatibility purposes: halfwidth and fullwidth characters used
+      with East Asian scripts, Arabic contextual forms (e.g., initial or
+      final forms), some ligatures, deprecated formatting characters,
+      variant forms of characters (or even copies of them) for
+      particular uses (e.g., phonetic or mathematical applications),
+      font variations, CJK compatibility ideographs, and so on.  For
+      additional information and the separate term "compatibility
+      decomposable character", see the Unicode standard.
+
+      For example, U+FF01 (FULLWIDTH EXCLAMATION MARK) was included for
+      compatibility with Asian charsets that include full-width and
+      half-width ASCII characters.
+
+      Some efforts in the IETF have concluded that it would be useful to
+      support mapping of some groups of compatibility equivalents and
+      not others (e.g., supporting or mapping width variations while
+      preserving or rejecting mathematical variations).  See the IDNA
+      Mapping document [RFC5895] for one example.
+
+
+
+
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 22]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+4.2.  Differentiation of Subsets
+
+   Especially as existing IETF standards are internationalized, it is
+   necessary to describe collections of characters including especially
+   various subsets of Unicode.  Because Unicode includes ways to code
+   substantially all characters in contemporary use, subsets of the
+   Unicode repertoire can be a useful tool for defining these
+   collections as repertoires independent of specific Unicode coding.
+
+   However specific collections are defined, it is important to remember
+   that, while older CCSs such as ASCII and the ISO 8859 family are
+   close-ended and fixed, Unicode is open-ended, with new character
+   definitions, and often new scripts, being added every year or so.
+   So, while, e.g., an ASCII subset, such as "uppercase letters", can be
+   specified as a range of code points (4/1 to 5/10 for that example),
+   similar definitions for Unicode either have to be specified in terms
+   of Unicode properties or are very dependent on Unicode versions (and
+   the relevant version must be identified in any specification).  See
+   the IDNA code point specification [RFC5892] for an example of
+   specification by combinations of properties.
+
+   Some terms are commonly used in the IETF to define character ranges
+   and subsets.  Some of these are imprecise and can cause confusion if
+   not used carefully.
+
+   non-ASCII
+
+      The term "non-ASCII" strictly refers to characters other than
+      those that appear in the ASCII repertoire, independent of the CCS
+      or encoding used for them.  In practice, if a repertoire such as
+      that of Unicode is established as context, "non-ASCII" refers to
+      characters in that repertoire that do not appear in the ASCII
+      repertoire.  "Outside the ASCII repertoire" and "outside the ASCII
+      range" are practical, and more precise, synonyms for "non-ASCII".
+
+   letters
+
+      The term "letters" does not have an exact equivalent in the
+      Unicode standard.  Letters are generally characters that are used
+      to write words, but that means very different things in different
+      languages and cultures.
+
+
+
+
+
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 23]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+5.  User Interface for Text
+
+   Although the IETF does not standardize user interfaces, many
+   protocols make assumptions about how a user will enter or see text
+   that is used in the protocol.  Internationalization challenges
+   assumptions about the type and limitations of the input and output
+   devices that may be used with applications that use various
+   protocols.  It is therefore useful to consider how users typically
+   interact with text that might contain one or more non-ASCII
+   characters.
+
+   input methods
+
+      An input method is a mechanism for a person to enter text into an
+      application. <RFC6365>
+
+      Text can be entered into a computer in many ways.  Keyboards are
+      by far the most common device used, but many characters cannot be
+      entered on typical computer keyboards in a single stroke.  Many
+      operating systems come with system software that lets users input
+      characters outside the range of what is allowed by keyboards.
+
+      For example, there are dozens of different input methods for Han
+      characters in Chinese, Japanese, and Korean.  Some start with
+      phonetic input through the keyboard, while others use the number
+      of strokes in the character.  Input methods are also needed for
+      scripts that have many diacritics, such as European or Vietnamese
+      characters that have two or three diacritics on a single
+      alphabetic character.
+
+      The term "input method editor" (IME) is often used generically to
+      describe the tools and software used to deal with input of
+      characters on a particular system.
+
+   rendering rules
+
+      A rendering rule is an algorithm that a system uses to decide how
+      to display a string of text. <RFC6365>
+
+      Some scripts can be directly displayed with fonts, where each
+      character from an input stream can simply be copied from a glyph
+      system and put on the screen or printed page.  Other scripts need
+      rules that are based on the context of the characters in order to
+      render text for display.
+
+
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 24]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+      Some examples of these rendering rules include:
+
+      *  Scripts such as Arabic (and many others), where the form of the
+         letter changes depending on the adjacent letters, whether the
+         letter is standing alone, at the beginning of a word, in the
+         middle of a word, or at the end of a word.  The rendering rules
+         must choose between two or more glyphs.
+
+      *  Scripts such as the Indic scripts, where consonants may change
+         their form if they are adjacent to certain other consonants or
+         may be displayed in an order different from the way they are
+         stored and pronounced.  The rendering rules must choose between
+         two or more glyphs.
+
+      *  Arabic and Hebrew scripts, where the order of the characters
+         displayed are changed by the bidirectional properties of the
+         alphabetic and other characters and with right-to-left and
+         left-to-right ordering marks.  The rendering rules must choose
+         the order that characters are displayed.
+
+      *  Some writing systems cannot have their rendering rules suitably
+         defined using mechanisms that are now defined in the Unicode
+         Standard.  None of those languages are in active non-scholarly
+         use today.
+
+      *  Many systems use a special rendering rule when they lack a font
+         or other mechanism for rendering a particular character
+         correctly.  That rule typically involves substitution of a
+         small open box or a question mark for the missing character.
+         See "undisplayable character" below.
+
+   graphic symbol
+
+      A graphic symbol is the visual representation of a graphic
+      character or of a composite sequence. <ISOIEC10646>
+
+   font
+
+      A font is a collection of glyphs used for the visual depiction of
+      character data.  A font is often associated with a set of
+      parameters (for example, size, posture, weight, and serifness),
+      which, when set to particular values, generates a collection of
+      imagable glyphs. <UNICODE>
+
+      The term "font" is often used interchangeably with "typeface".  As
+      historically used in typography, a typeface is a family of one or
+      more fonts that share a common general design.  For example,
+      "Times Roman" is actually a typeface, with a collection of fonts
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 25]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+      such as "Times Roman Bold", "Times Roman Medium", "Times Roman
+      Italic", and so on.  Some sources even consider different type
+      sizes within a typeface to be different fonts.  While those
+      distinctions are rarely important for internationalization
+      purposes, there are exceptions.  Those writing specifications
+      should be very careful about definitions in cases in which the
+      exceptions might lead to ambiguity.
+
+   bidirectional display
+
+      The process or result of mixing left-to-right oriented text and
+      right-to-left oriented text in a single line is called
+      bidirectional display, often abbreviated as "bidi". <UNICODE>
+
+      Most of the world's written languages are displayed left-to-right.
+      However, many widely-used written languages such as ones based on
+      the Hebrew or Arabic scripts are displayed primarily right-to-left
+      (numerals are a common exception in the modern scripts).  Right-
+      to-left text often confuses protocol writers because they have to
+      keep thinking in terms of the order of characters in a string in
+      memory, an order that might be different from what they see on the
+      screen.  (Note that some languages are written both horizontally
+      and vertically and that some historical ones use other display
+      orderings.)
+
+      Further, bidirectional text can cause confusion because there are
+      formatting characters in ISO/IEC 10646 that cause the order of
+      display of text to change.  These explicit formatting characters
+      change the display regardless of the implicit left-to-right or
+      right-to-left properties of characters.  Text that might contain
+      those characters typically requires careful processing before
+      being sorted or compared for equality.
+
+      It is common to see strings with text in both directions, such as
+      strings that include both text and numbers, or strings that
+      contain a mixture of scripts.
+
+      Unicode has a long and incredibly detailed algorithm for
+      displaying bidirectional text [UAX9].
+
+   undisplayable character
+
+      A character that has no displayable form. <RFC6365>
+
+      For instance, the zero-width space (U+200B) cannot be displayed
+      because it takes up no horizontal space.  Formatting characters
+      such as those for setting the direction of text are also
+      undisplayable.  Note, however, that every character in [UNICODE]
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 26]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+      has a glyph associated with it, and that the glyphs for
+      undisplayable characters are enclosed in a dashed square as an
+      indication that the actual character is undisplayable.
+
+      The property of a character that causes it to be undisplayable is
+      intrinsic to its definition.  Undisplayable characters can never
+      be displayed in normal text (the dashed square notation is used
+      only in special circumstances).  Printable characters whose
+      Unicode definitions are associated with glyphs that cannot be
+      rendered on a particular system are not, in this sense,
+      undisplayable.
+
+   writing style
+
+      Conventions of writing the same script in different styles.
+      <RFC6365>
+
+      Different communities using the script may find text in different
+      writing styles difficult to read and possibly unintelligible.  For
+      example, the Perso-Arabic Nastalique writing style and the Arabic
+      Naskh writing style both use the Arabic script but have very
+      different renderings and are not mutually comprehensible.  Writing
+      styles may have significant impact on internationalization; for
+      example, the Nastalique writing style requires significantly more
+      line height than Naskh writing style.
+
+6.  Text in Current IETF Protocols
+
+   Many IETF protocols started off being fully internationalized, while
+   others have been internationalized as they were revised.  In this
+   process, IETF members have seen patterns in the way that many
+   protocols use text.  This section describes some specific protocol
+   interactions with text.
+
+   protocol elements
+
+      Protocol elements are uniquely named parts of a protocol.
+      <RFC6365>
+
+      Almost every protocol has named elements, such as "source port" in
+      TCP.  In some protocols, the names of the elements (or text tokens
+      for the names) are transmitted within the protocol.  For example,
+      in SMTP and numerous other IETF protocols, the names of the verbs
+      are part of the command stream.  The names are thus part of the
+      protocol standard.  The names of protocol elements are not
+      normally seen by end users, and it is rarely appropriate to
+      internationalize protocol element names (even while the elements
+      themselves can be internationalized).
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 27]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+   name spaces
+
+      A name space is the set of valid names for a particular item, or
+      the syntactic rules for generating these valid names. <RFC6365>
+
+      Many items in Internet protocols use names to identify specific
+      instances or values.  The names may be generated (by some
+      prescribed rules), registered centrally (e.g., such as with IANA),
+      or have a distributed registration and control mechanism, such as
+      the names in the DNS.
+
+   on-the-wire encoding
+
+      The encoding and decoding used before and after transmission over
+      the network is often called the "on-the-wire" (or sometimes just
+      "wire") format. <RFC6365>
+
+      Characters are identified by code points.  Before being
+      transmitted in a protocol, they must first be encoded as bits and
+      octets.  Similarly, when characters are received in a
+      transmission, they have been encoded, and a protocol that needs to
+      process the individual characters needs to decode them before
+      processing.
+
+   parsed text
+
+      Text strings that have been analyzed for subparts. <RFC6365>
+
+      In some protocols, free text in text fields might be parsed.  For
+      example, many mail user agents (MUAs) will parse the words in the
+      text of the Subject: field to attempt to thread based on what
+      appears after the "Re:" prefix.
+
+      Such conventions are very sensitive to localization.  If, for
+      example, a form like "Re:" is altered by an MUA to reflect the
+      language of the sender or recipient, a system that subsequently
+      does threading may not recognize the replacement term as a
+      delimiter string.
+
+   charset identification
+
+      Specification of the charset used for a string of text. <RFC6365>
+
+      Protocols that allow more than one charset to be used in the same
+      place should require that the text be identified with the
+      appropriate charset.  Without this identification, a program
+      looking at the text cannot definitively discern the charset of the
+      text.  Charset identification is also called "charset tagging".
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 28]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+   language identification
+
+      Specification of the human language used for a string of text.
+      <RFC6365>
+
+      Some protocols (such as MIME and HTTP) allow text that is meant
+      for machine processing to be identified with the language used in
+      the text.  Such identification is important for machine processing
+      of the text, such as by systems that render the text by speaking
+      it.  Language identification is also called "language tagging".
+      The IETF "LTRU" standards [RFC5646] and [RFC4647] provide a
+      comprehensive model for language identification.
+
+   MIME
+
+      MIME (Multipurpose Internet Mail Extensions) is a message format
+      that allows for textual message bodies and headers in character
+      sets other than US-ASCII in formats that require ASCII (most
+      notably RFC 5322, the standard for Internet mail headers
+      [RFC5322]).  MIME is described in RFCs 2045 through 2049, as well
+      as more recent RFCs. <RFC6365>
+
+   transfer encoding syntax
+
+      A transfer encoding syntax (TES) (sometimes called a transfer
+      encoding scheme) is a reversible transform of already encoded data
+      that is represented in one or more character encoding schemes.
+      <RFC6365>
+
+      TESs are useful for encoding types of character data into another
+      format, usually for allowing new types of data to be transmitted
+      over legacy protocols.  The main examples of TESs used in the IETF
+      include Base64 and quoted-printable.  MIME identifies the transfer
+      encoding syntax for body parts as a Content-transfer-encoding,
+      occasionally abbreviated C-T-E.
+
+   Base64
+
+      Base64 is a transfer encoding syntax that allows binary data to be
+      represented by the ASCII characters A through Z, a through z, 0
+      through 9, +, /, and =.  It is defined in [RFC2045]. <RFC6365>
+
+   quoted printable
+
+      Quoted printable is a transfer encoding syntax that allows strings
+      that have non-ASCII characters mixed in with mostly ASCII
+      printable characters to be somewhat human readable.  It is
+      described in [RFC2047]. <RFC6365>
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 29]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+      The quoted printable syntax is generally considered to be a
+      failure at being readable.  It is jokingly referred to as "quoted
+      unreadable".
+
+   XML
+
+      XML (which is an approximate abbreviation for Extensible Markup
+      Language) is a popular method for structuring text.  XML text that
+      is not encoded as UTF-8 is explicitly tagged with charsets, and
+      all text in XML consists only of Unicode characters.  The
+      specification for XML can be found at <http://www.w3.org/XML/>.
+      <RFC6365>
+
+   ASN.1 text formats
+
+      The ASN.1 data description language has many formats for text
+      data.  The formats allow for different repertoires and different
+      encodings.  Some of the formats that appear in IETF standards
+      based on ASN.1 include IA5String (all ASCII characters),
+      PrintableString (most ASCII characters, but missing many
+      punctuation characters), BMPString (characters from ISO/IEC 10646
+      plane 0 in UTF-16BE format), UTF8String (just as the name
+      implies), and TeletexString (also called T61String).
+
+   ASCII-compatible encoding (ACE)
+
+      Starting in 1996, many ASCII-compatible encoding schemes (which
+      are actually transfer encoding syntaxes) have been proposed as
+      possible solutions for internationalizing host names and some
+      other purposes.  Their goal is to be able to encode any string of
+      ISO/IEC 10646 characters using the preferred syntax for domain
+      names (as described in STD 13).  At the time of this writing, only
+      the ACE produced by Punycode [RFC3492] has become an IETF
+      standard.
+
+      The choice of ACE forms to internationalize legacy protocols must
+      be made with care as it can cause some difficult side effects
+      [RFC6055].
+
+   LDH label
+
+      The classical label form used in the DNS and most applications
+      that call on it, albeit with some additional restrictions,
+      reflects the early syntax of "hostnames" [RFC0952] and limits
+      those names to ASCII letters, digits, and embedded hyphens.  The
+      hostname syntax is identical to that described as the "preferred
+      name syntax" in Section 3.5 of RFC 1034 [RFC1034] as modified by
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 30]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+      RFC 1123 [RFC1123].  LDH labels are defined in a more restrictive
+      and precise way for internationalization contexts as part of the
+      IDNA2008 specification [RFC5890].
+
+7.  Terms Associated with Internationalized Domain Names
+
+7.1.  IDNA Terminology
+
+   The current specification for Internationalized Domain Names (IDNs),
+   known formally as Internationalized Domain Names for Applications or
+   IDNA, is referred to in the IETF and parts of the broader community
+   as "IDNA2008" and consists of several documents.  Section 2.3 of the
+   first of those documents, commonly known as "IDNA2008 Definitions"
+   [RFC5890] provides definitions and introduces some specialized terms
+   for differentiating among types of DNS labels in an IDN context.
+   Those terms are listed in the table below; see RFC 5890 for the
+   specific definitions if needed.
+
+      ACE Prefix
+      A-label
+      Domain Name Slot
+      IDNA-valid string
+      Internationalized Domain Name (IDN)
+      Internationalized Label
+      LDH Label
+      Non-Reserved LDH label (NR-LDH label)
+      U-label
+
+   Two additional terms entered the IETF's vocabulary as part of the
+   earlier IDN effort [RFC3490] (IDNA2003):
+
+      Stringprep
+
+         Stringprep [RFC3454] provides a model and character tables for
+         preparing and handling internationalized strings.  It was used
+         in the original IDN specification (IDNA2003) via a profile
+         called "Nameprep" [RFC3491].  It is no longer in use in IDNA,
+         but continues to be used in profiles by a number of other
+         protocols. <RFC6365>
+
+      Punycode
+
+         This is the name of the algorithm [RFC3492] used to convert
+         otherwise-valid IDN labels from native-character strings
+         expressed in Unicode to an ASCII-compatible encoding (ACE).
+         Strictly speaking, the term applies to the algorithm only.  In
+         practice, it is widely, if erroneously, used to refer to
+         strings that the algorithm encodes.
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 31]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+7.2.  Character Relationships and Variants
+
+   The term "variant" was introduced into the IETF i18n vocabulary with
+   the JET recommendations [RFC3743].  As used there, it referred
+   strictly to the relationship between Traditional Chinese characters
+   and their Simplified equivalents.  The JET recommendations provided a
+   model for identifying these pairs of characters and labels that used
+   them.  Specific recommendations for variant handling for the Chinese
+   language were provided in a follow-up document [RFC4713].
+
+   In more recent years, the term has also been used to describe other
+   collections of characters or strings that might be perceived as
+   equivalent.  Those collections have involved one or more of several
+   categories of characters and labels containing them including:
+
+   o  "visually similar" or "visually confusable" characters.  These may
+      be limited to characters in different scripts, characters in a
+      single script, or both, and may be those that can appear to be
+      alike even when high-distinguishability reference fonts are used
+      or under various circumstances that may involve malicious choices
+      of typefaces or other ways to trick user perception.  Trivial
+      examples include ASCII "l" and "1" and Latin and Cyrillic "a".
+
+   o  Characters assigned more than one Unicode code point because of
+      some special property.  These characters may be considered "the
+      same" for some purposes and different for others (or by other
+      users).  One of the most commonly cited examples is the Arabic
+      YEH, which is encoded more than once because some of its shapes
+      are different across different languages.  Another example are the
+      Greek lowercase sigma and final sigma: if the latter were viewed
+      purely as a positional presentation variation on the former, it
+      should not have been assigned a separate code point.
+
+   o  Numerals and labels including them.  Unlike letters, the "meaning"
+      of decimal digits is clear and unambiguous regardless of the
+      script with which they are associated.  Some scripts are routinely
+      used almost interchangeably with European digits and digits native
+      to that script.  The Arabic script has two sets of digits
+      (U+0660..U+0669 and U+06F0..U=06F9), written identically for zero
+      through three and seven through nine but differently for four
+      through six; European digits predominate in other areas.
+      Substitution of digits with the same numeric value in labels may
+      give rise to another type of variant.
+
+   o  Orthographic differences within a language.  Many languages have
+      alternate choices of spellings or spellings that differ by locale.
+      Users of those languages generally recognize the spellings as
+      equivalent, at least as much so as the variations described above.
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 32]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+      Examples include "color" and "colour" in English, German words
+      spelled with o-umlaut or "oe", and so on.  Some of these
+      relationships may also create other types of language-specific
+      perceived differences that do not exist for other languages using
+      the same script.  For example, in Arabic language usage at the end
+      of words, ARABIC LETTER TEH MARBUTA (U+0629) and ARABIC LETTER HEH
+      (U+0647) are differently shaped (one has 2 dots in top of it), but
+      they are used interchangeably in writing: they "sound" similar
+      when pronounced at the end of phrase, and hence the LETTER TEH
+      MARBUTA sometimes is written as LETTER HEH and the two are
+      considered "confusable" in that context.
+
+   The term "variant" as used in this section should also not be
+   confused with other uses of the term in this document or in Unicode
+   terminology (e.g., those in Section 4.1 above).  If the term is to be
+   used at all, context should clearly distinguish among these different
+   uses and, in particular, between variant characters and variant
+   labels.  Local text should identify which meaning, or combination of
+   meanings, are intended.
+
+8.  Other Common Terms in Internationalization
+
+   This is a hodge-podge of other terms that have appeared in
+   internationalization discussions in the IETF.
+
+   locale
+
+      Locale is the user-specific location and cultural information
+      managed by a computer. <RFC6365>
+
+      Because languages and orthographic conventions differ from country
+      to country (and even region to region within a country), the
+      locale of the user can often be an important factor.  Typically,
+      the locale information for a user includes the language(s) used.
+
+      Locale issues go beyond character use, and can include things such
+      as the display format for currency, dates, and times.  Some
+      locales (especially the popular "C" and "POSIX" locales) do not
+      include language information.
+
+      It should be noted that there are many thorny, unsolved issues
+      with locale.  For example, should text be viewed using the locale
+      information of the person who wrote the text, information that
+      would apply to the location of the system storing or providing the
+      text, or the person viewing it?  What if the person viewing it is
+      traveling to different locations?  Should only some of the locale
+      information affect creation and editing of text?
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 33]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+   Latin characters
+
+      "Latin characters" is a not-precise term for characters
+      historically related to ancient Greek script as modified in the
+      Roman Republic and Empire and currently used throughout the world.
+      <RFC6365>
+
+      The base Latin characters are a subset of the ASCII repertoire and
+      have been augmented by many single and multiple diacritics and
+      quite a few other characters.  ISO/IEC 10646 encodes the Latin
+      characters in including ranges U+0020..U+024F and U+1E00..U+1EFF.
+
+      Because "Latin characters" is used in different contexts to refer
+      to the letters from the ASCII repertoire, the subset of those
+      characters used late in the Roman Republic period, or the
+      different subset used to write Latin in medieval times, the entire
+      ASCII repertoire, all of the code points in the extended Latin
+      script as defined by Unicode, and other collections, the term
+      should be avoided in IETF specifications when possible.
+      Similarly, "Basic Latin" should not be used as a synonym for
+      "ASCII".
+
+   romanization
+
+      The transliteration of a non-Latin script into Latin characters.
+      <RFC6365>
+
+      Because of their widespread use, Latin characters (or graphemes
+      constructed from them) are often used to try to write text in
+      languages that didn't previously have writing systems or whose
+      writing systems were originally based on different scripts.  For
+      example, there are two popular romanizations of Chinese: Wade-
+      Giles and Pinyin, the latter of which is by far more common today.
+      Many romanization systems are inexact and do not give perfect
+      round-trip mappings between the native script and the Latin
+      characters.
+
+   CJK characters and Han characters
+
+      The ideographic characters used in Chinese, Japanese, Korean, and
+      traditional Vietnamese writing systems are often called "CJK
+      characters" after the initial letters of the language names in
+      English.  They are also called "Han characters", after the term in
+      Chinese that is often used for these characters. <RFC6365>
+
+
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 34]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+      Note that Han characters do not include the phonetic characters
+      used in the Japanese and Korean languages.  Users of the term "CJK
+      characters" may or may not assume those additional characters are
+      included.
+
+      In ISO/IEC 10646, the Han characters were "unified", meaning that
+      each set of Han characters from Japanese, Chinese, and/or Korean
+      that had the same origin was assigned a single code point.  The
+      positive result of this was that many fewer code points were
+      needed to represent Han; the negative result of this was that
+      characters that people who write the three languages think are
+      different have the same code point.  There is a great deal of
+      disagreement on the nature, the origin, and the severity of the
+      problems caused by Han unification.
+
+   translation
+
+      The process of conveying the meaning of some passage of text in
+      one language, so that it can be expressed equivalently in another
+      language. <RFC6365>
+
+      Many language translation systems are inexact and cannot be
+      applied repeatedly to go from one language to another to another.
+
+   transliteration
+
+      The process of representing the characters of an alphabetical or
+      syllabic system of writing by the characters of a conversion
+      alphabet. <RFC6365>
+
+      Many script transliterations are exact, and many have perfect
+      round-trip mappings.  The notable exception to this is
+      romanization, described above.  Transliteration involves
+      converting text expressed in one script into another script,
+      generally on a letter-by-letter basis.  There are many official
+      and unofficial transliteration standards, most notably those from
+      ISO TC 46 and the U.S. Library of Congress.
+
+   transcription
+
+      The process of systematically writing the sounds of some passage
+      of spoken language, generally with the use of a technical phonetic
+      alphabet (usually Latin-based) or other systematic transcriptional
+      orthography.  Transcription also sometimes refers to the
+      conversion of written text into a transcribed form, based on the
+      sound of the text as if it had been spoken. <RFC6365>
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 35]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+      Unlike transliterations, which are generally designed to be round-
+      trip convertible, transcriptions of written material are almost
+      never round-trip convertible to their original form, at least
+      without some supplemental information.
+
+   regular expressions
+
+      Regular expressions provide a mechanism to select specific strings
+      from a set of character strings.  Regular expressions are a
+      language used to search for text within strings, and possibly
+      modify the text found with other text. <RFC6365>
+
+      Pattern matching for text involves being able to represent one or
+      more code points in an abstract notation, such as searching for
+      all capital Latin letters or all punctuation.  The most common
+      mechanism in IETF protocols for naming such patterns is the use of
+      regular expressions.  There is no single regular expression
+      language, but there are numerous very similar dialects that are
+      not quite consistent with each other.
+
+      The Unicode Consortium has a good discussion about how to adapt
+      regular expression engines to use Unicode.  [UTR18]
+
+   private use character
+
+      ISO/IEC 10646 code points from U+E000 to U+F8FF, U+F0000 to
+      U+FFFFD, and U+100000 to U+10FFFD are available for private use.
+      This refers to code points of the standard whose interpretation is
+      not specified by the standard and whose use may be determined by
+      private agreement among cooperating users. <UNICODE>
+
+      The use of these "private use" characters is defined by the
+      parties who transmit and receive them, and is thus not appropriate
+      for standardization.  (The IETF has a long history of private use
+      names for things such as "x-" names in MIME types, charsets, and
+      languages.  Most of the experience with these has been quite
+      negative, with many implementors assuming that private use names
+      are in fact public and long-lived.)
+
+9.  Security Considerations
+
+   Security is not discussed directly in this document.  While the
+   definitions here have no direct effect on security, they are used in
+   many security contexts.  For example, authentication usually involves
+   comparing two tokens, and one or both of those tokens might be text;
+   thus, some methods of comparison might involve using some of the
+   internationalization concepts for which terms are defined in this
+   document.
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 36]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+   Having said that, other RFCs dealing with internationalization have
+   security consideration descriptions that may be useful to the reader
+   of this document.  In particular, the security considerations in RFC
+   3454, RFC 3629, RFC 4013 [RFC4013], and RFC 5890 go into a fair
+   amount of detail.
+
+10.  References
+
+10.1.  Normative References
+
+   [ISOIEC10646]   ISO/IEC, "ISO/IEC 10646:2011.  International Standard
+                   -- Information technology - Universal Multiple-Octet
+                   Coded Character Set (UCS)", 2011.
+
+   [RFC2047]       Moore, K., "MIME (Multipurpose Internet Mail
+                   Extensions) Part Three: Message Header Extensions for
+                   Non-ASCII Text", RFC 2047, November 1996.
+
+   [UNICODE]       The Unicode Consortium, "The Unicode Standard,
+                   Version 6.0", (Mountain View, CA: The Unicode
+                   Consortium, 2011. ISBN 978-1-936213-01-6).
+                   <http://www.unicode.org/versions/Unicode6.0.0/>.
+
+10.2.  Informative References
+
+   [CHARMOD]       W3C, "Character Model for the World Wide Web 1.0",
+                   2005, <http://www.w3.org/TR/charmod/>.
+
+   [FRAMEWORK]     ISO/IEC, "ISO/IEC TR 11017:1997(E).  Information
+                   technology - Framework for internationalization,
+                   prepared by ISO/IEC JTC 1/SC 22/WG 20", 1997.
+
+   [ISO3166]       ISO, "ISO 3166-1:2006 - Codes for the representation
+                   of names of countries and their subdivisions -- Part
+                   1: Country codes", 2006.
+
+   [ISO639]        ISO, "ISO 639-1:2002 - Code for the representation of
+                   names of languages - Part 1: Alpha-2 code", 2002.
+
+   [ISO6429]       ISO/IEC, "ISO/IEC, "ISO/IEC 6429:1992.  Information
+                   technology -- Control functions for coded character
+                   sets"", ISO/IEC 6429:1992, 1992.
+
+   [RFC0952]       Harrenstien, K., Stahl, M., and E. Feinler, "DoD
+                   Internet host table specification", RFC 952,
+                   October 1985.
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 37]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+   [RFC1034]       Mockapetris, P., "Domain names - concepts and
+                   facilities", STD 13, RFC 1034, November 1987.
+
+   [RFC1123]       Braden, R., "Requirements for Internet Hosts -
+                   Application and Support", STD 3, RFC 1123,
+                   October 1989.
+
+   [RFC2045]       Freed, N. and N. Borenstein, "Multipurpose Internet
+                   Mail Extensions (MIME) Part One: Format of Internet
+                   Message Bodies", RFC 2045, November 1996.
+
+   [RFC2119]       Bradner, S., "Key words for use in RFCs to Indicate
+                   Requirement Levels", BCP 14, RFC 2119, March 1997.
+
+   [RFC2277]       Alvestrand, H., "IETF Policy on Character Sets and
+                   Languages", BCP 18, RFC 2277, January 1998.
+
+   [RFC2781]       Hoffman, P. and F. Yergeau, "UTF-16, an encoding of
+                   ISO 10646", RFC 2781, February 2000.
+
+   [RFC2978]       Freed, N. and J. Postel, "IANA Charset Registration
+                   Procedures", BCP 19, RFC 2978, October 2000.
+
+   [RFC3454]       Hoffman, P. and M. Blanchet, "Preparation of
+                   Internationalized Strings ("stringprep")", RFC 3454,
+                   December 2002.
+
+   [RFC3490]       Faltstrom, P., Hoffman, P., and A. Costello,
+                   "Internationalizing Domain Names in Applications
+                   (IDNA)", RFC 3490, March 2003.
+
+   [RFC3491]       Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
+                   Profile for Internationalized Domain Names (IDN)",
+                   RFC 3491, March 2003.
+
+   [RFC3492]       Costello, A., "Punycode: A Bootstring encoding of
+                   Unicode for Internationalized Domain Names in
+                   Applications (IDNA)", RFC 3492, March 2003.
+
+   [RFC3629]       Yergeau, F., "UTF-8, a transformation format of ISO
+                   10646", STD 63, RFC 3629, November 2003.
+
+   [RFC3743]       Konishi, K., Huang, K., Qian, H., and Y. Ko, "Joint
+                   Engineering Team (JET) Guidelines for
+                   Internationalized Domain Names (IDN) Registration and
+                   Administration for Chinese, Japanese, and Korean",
+                   RFC 3743, April 2004.
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 38]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+   [RFC4013]       Zeilenga, K., "SASLprep: Stringprep Profile for User
+                   Names and Passwords", RFC 4013, February 2005.
+
+   [RFC4647]       Phillips, A. and M. Davis, "Matching of Language
+                   Tags", BCP 47, RFC 4647, September 2006.
+
+   [RFC4713]       Lee, X., Mao, W., Chen, E., Hsu, N., and J. Klensin,
+                   "Registration and Administration Recommendations for
+                   Chinese Domain Names", RFC 4713, October 2006.
+
+   [RFC5137]       Klensin, J., "ASCII Escaping of Unicode Characters",
+                   BCP 137, RFC 5137, February 2008.
+
+   [RFC5198]       Klensin, J. and M. Padlipsky, "Unicode Format for
+                   Network Interchange", RFC 5198, March 2008.
+
+   [RFC5322]       Resnick, P., Ed., "Internet Message Format",
+                   RFC 5322, October 2008.
+
+   [RFC5646]       Phillips, A. and M. Davis, "Tags for Identifying
+                   Languages", BCP 47, RFC 5646, September 2009.
+
+   [RFC5890]       Klensin, J., "Internationalized Domain Names for
+                   Applications (IDNA): Definitions and Document
+                   Framework", RFC 5890, August 2010.
+
+   [RFC5892]       Faltstrom, P., "The Unicode Code Points and
+                   Internationalized Domain Names for Applications
+                   (IDNA)", RFC 5892, August 2010.
+
+   [RFC5895]       Resnick, P. and P. Hoffman, "Mapping Characters for
+                   Internationalized Domain Names in Applications (IDNA)
+                   2008", RFC 5895, September 2010.
+
+   [RFC6055]       Thaler, D., Klensin, J., and S. Cheshire, "IAB
+                   Thoughts on Encodings for Internationalized Domain
+                   Names", RFC 6055, February 2011.
+
+   [UAX34]         The Unicode Consortium, "Unicode Standard Annex #34:
+                   Unicode Named Character Sequences", 2010,
+                   <http://www.unicode.org/reports/tr34>.
+
+   [UAX9]          The Unicode Consortium, "Unicode Standard Annex #9:
+                   Unicode Bidirectional Algorithm", 2010,
+                   <http://www.unicode.org/reports/tr9>.
+
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 39]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+   [US-ASCII]      ANSI, "Coded Character Set -- 7-bit American Standard
+                   Code for Information Interchange, ANSI X3.4-1986",
+                   1986.
+
+   [UTN6]          The Unicode Consortium, "Unicode Technical Note #5:
+                   BOCU-1: MIME-Compatible Unicode Compression", 2006,
+                   <http://www.unicode.org/notes/tn6/>.
+
+   [UTR15]         The Unicode Consortium, "Unicode Standard Annex #15:
+                   Unicode Normalization Forms", 2010,
+                   <http://www.unicode.org/reports/tr15>.
+
+   [UTR18]         The Unicode Consortium, "Unicode Standard Annex #18:
+                   Unicode Regular Expressions", 2008,
+                   <http://www.unicode.org/reports/tr18>.
+
+   [UTR22]         The Unicode Consortium, "Unicode Technical Standard
+                   #22: Unicode Character Mapping Markup Language",
+                   2009, <http://www.unicode.org/reports/tr22>.
+
+   [UTR6]          The Unicode Consortium, "Unicode Technical Standard
+                   #6: A Standard Compression Scheme for Unicode", 2005,
+                   <http://www.unicode.org/reports/tr6>.
+
+   [W3C-i18n-Def]  W3C, "Localization vs. Internationalization",
+                   September 2010, <http://www.w3.org/International/
+                   questions/qa-i18n.en>.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 40]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+Appendix A.  Additional Interesting Reading
+
+   Barry, Randall, ed.  ALA-LC Romanization Tables.  Washington: U.S.
+   Library of Congress, 1997.  ISBN 0844409405
+
+   Coulmas, Florian.  Blackwell Encyclopedia of Writing Systems.
+   Oxford: Blackwell Publishers, 1999.  ISBN 063121481X
+
+   Dalby, Andrew.  Dictionary of Languages: The Definitive Reference to
+   More than 400 Languages.  New York: Columbia University Press, 2004.
+   ISBN 978-0231115698
+
+   Daniels, Peter, and William Bright.  The World's Writing Systems.
+   New York: Oxford University Press, 1996.  ISBN 0195079930
+
+   DeFrancis, John.  The Chinese Language: Fact and Fantasy.  Honolulu:
+   University of Hawaii Press, 1984.  ISBN 0-8284-085505 and
+   0-8248-1058-6
+
+   Drucker, Joanna.  The Alphabetic Labyrinth: The Letters in History
+   and Imagination.  London: Thames & Hudson, 1995.  ISBN 0-500-28068-1
+
+   Fazzioli, Edoardo.  Chinese Calligraphy.  New York: Abbeville Press,
+   1986, 1987 (English translation).  ISBN 0-89659-774-1
+
+   Hooker, J.T., et al.  Reading the Past: Ancient Writing from
+   Cuneiform to the Alphabet.  London: British Museum Press, 1990.  ISBN
+   0-7141-8077-7
+
+   Lunde, Ken.  CJKV Information Processing.  Sebastopol, CA: O'Reilly &
+   Assoc., 1999.  ISBN 1-56592-224-7
+
+   Nakanishi, Akira.  Writing Systems of the World.  Rutland, VT:
+   Charles E. Tuttle Company, 1980.  ISBN 0804816549
+
+   Robinson, Andrew.  The Story of Writing: Alphabets, Hieroglyphs, &
+   Pictograms.  London: Thames & Hudson, 1995, 2000.  ISBN 0-500-28156-4
+
+   Sacks, David.  Language Visible.  New York: Broadway Books (a
+   division of Random House, Inc.), 2003.  ISBN 0-7679-1172-5
+
+
+
+
+
+
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 41]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+Appendix B.  Acknowledgements
+
+   The definitions in this document come from many sources, including a
+   wide variety of IETF documents.
+
+   James Seng contributed to the initial outline of RFC 3536.  Harald
+   Alvestrand and Martin Duerst made extensive useful comments on early
+   versions.  Others who contributed to the development of RFC 3536
+   include Dan Kohn, Jacob Palme, Johan van Wingen, Peter Constable,
+   Yuri Demchenko, Susan Harris, Zita Wenzel, John Klensin, Henning
+   Schulzrinne, Leslie Daigle, Markus Scherer, and Ken Whistler.
+
+   Abdulaziz Al-Zoman, Tim Bray, Frank Ellermann, Antonio Marko, JFC
+   Morphin, Sarmad Hussain, Mykyta Yevstifeyev, Ken Whistler, and others
+   identified important issues with, or made specific suggestions for,
+   this new version.
+
+Appendix C.  Significant Changes from RFC 3536
+
+   This document mostly consists of additions to RFC 3536.  The
+   following is a list of the most significant changes.
+
+   o  Changed the document's status to BCP.
+
+   o  Commonly used synonyms added to several descriptions and indexed.
+
+   o  A list of terms defined and used in IDNA2008 was added, with a
+      pointer to RFC 5890.  Those definitions have not been repeated in
+      this document.
+
+   o  The much-abused term "variant" is now discussed in some detail.
+
+   o  A discussion of different subsets of the Unicode repertoire was
+      added as Section 4.2 and associated definitions were included.
+
+   o  Added a new term, "writing style".
+
+   o  Discussions of case-folding and mapping were expanded.
+
+   o  Minor edits were made to some section titles and a number of other
+      editorial improvements were made.
+
+   o  The discussion of control codes was updated to include additional
+      information and clarify that "control code" and "control
+      character" are synonyms.
+
+   o  Many terms were clarified to reflect contemporary usage.
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 42]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+   o  The index to terms by section in RFC 3536 was replaced by an index
+      to pages containing considerably more terms.
+
+   o  The acknowledgments were updated.
+
+   o  Some of the references were updated.
+
+   o  The supplemental reading list was expanded somewhat.
+
+Index
+
+   A
+      A-label  31
+      ACE  30, 31
+      ACE Prefix  31
+      alphabetic  20
+      ANSI  13
+      ASCII  15
+      ASCII-compatible encoding  30, 31
+      ASN.1 text formats  30
+
+   B
+      Base64  29
+      Basic Multilingual Plane  13
+      bidi  26
+      bidirectional display  26
+      BMP  13
+      BMPString  30
+      BOCU-1  14
+      BOM  14
+      byte order mark  14
+
+   C
+      C-T-E  29
+      case  18
+      CCS  7
+      CEN/ISSS  13
+      character  6
+      character encoding form  7
+      character encoding scheme  8
+      character repertoire  7
+      charset  8
+      charset identification  28
+      CJK characters  34
+      code chart  19
+      code point  16
+      code table  19
+      coded character  6
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 43]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+      coded character set  7
+      collation  18
+      combining character  16
+      combining character sequence  16
+      compatibility character  22
+      compatibility variant  22
+      composite sequence  16
+      content-transfer-encoding  29
+      control character  21
+      control code  21
+      control sequence  22
+
+   D
+      decomposed character  16
+      diacritic  21
+      displaying and rendering text  10
+      Domain Name Slot  31
+
+   E
+      encoding forms  13
+
+   F
+      font  25
+      formatting character  22
+
+   G
+      glyph  7
+      glyph code  7
+      graphic symbol  25
+
+   H
+      Han characters  34
+
+   I
+      i18n  9
+      IA5String  30
+      ideographic  20
+      IDN  31
+      IDNA  31
+      IDNA-valid string  31
+      IDNA2003  31
+      IDNA2008  31
+      IME  24
+      input method editor  24
+      input methods  24
+      internationalization  8
+      Internationalized Domain Name  31
+      Internationalized Label  31
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 44]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+      ISO  11
+      ISO 639  11
+      ISO 3166  11
+      ISO 8859  15
+      ISO TC 46  11
+
+   J
+      JIS  13
+      JTC 1  11
+
+   L
+      l10n  9
+      language  5
+      language identification  29
+      Latin characters  34
+      LDH Label  30
+      letters  23
+      Local and regional standards organizations  13
+      locale  33
+      localization  9
+
+   M
+      MIME  29
+      multilingual  10
+
+   N
+      name spaces  28
+      Nameprep  31
+      NFC  17
+      NFD  17
+      NFKC  17
+      NFKD  17
+      non-ASCII  23
+      nonspacing character  21
+      normalization  17
+      NR-LDH label  31
+      NVT  15
+
+   O
+      on-the-wire encoding  28
+
+
+
+
+
+
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 45]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+   P
+      parsed text  28
+      precomposed character  16
+      PrintableString  30
+      private use charater  36
+      protocol elements  27
+      punctuation  21
+      Punycode  30, 31
+
+   Q
+      quoted-printable  29
+
+   R
+      regular expressions  36
+      rendering rules  24
+      repertoire  7
+      romanization  34
+
+   S
+      SAC  13
+      script  5
+      SCSU  14
+      sorting  18
+      Stringprep  31
+      surrogate pair  14
+      symbol  21
+
+   T
+      T61String  30
+      TeletexString  30
+      TES  29
+      transcoding  7
+      transcription  35
+      transfer encoding syntax  29
+      transformation formats  13
+      translation  35
+      transliteration  34, 35
+      typeface  25
+
+   U
+      U-label  31
+      UCS-2  13
+      UCS-4  13
+      undisplayable character  26
+      Unicode Consortium  12
+      US-ASCII  15
+      UTC  12
+      UTF-8  14
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 46]
+
+RFC 6365            Internationalization Terminology      September 2011
+
+
+      UTF-16  14
+      UTF-16BE  14
+      UTF-16LE  14
+      UTF-32  14
+      UTF8String  30
+
+   V
+      variant  32
+
+   W
+      W3C  13
+      World Wide Web Consortium  13
+      writing style  27
+      writing system  6
+
+   X
+      XML  13, 30
+
+Authors' Addresses
+
+   Paul Hoffman
+   VPN Consortium
+
+   EMail: paul.hoffman@vpnc.org
+
+
+   John C Klensin
+   1770 Massachusetts Ave, Ste 322
+   Cambridge, MA  02140
+   USA
+
+   Phone: +1 617 245 1457
+   EMail: john+ietf@jck.com
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Hoffman & Klensin         Best Current Practice                [Page 47]
+
author	Thomas Voss <mail@thomasvoss.com>	2024-11-27 20:54:24 +0100
committer	Thomas Voss <mail@thomasvoss.com>	2024-11-27 20:54:24 +0100
commit	4bfd864f10b68b71482b35c818559068ef8d5797 (patch)
tree	e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc6365.txt
parent	ea76e11061bda059ae9f9ad130a9895cc85607db (diff)