From 4bfd864f10b68b71482b35c818559068ef8d5797 Mon Sep 17 00:00:00 2001 From: Thomas Voss Date: Wed, 27 Nov 2024 20:54:24 +0100 Subject: doc: Add RFC documents --- doc/rfc/rfc6497.txt | 843 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 843 insertions(+) create mode 100644 doc/rfc/rfc6497.txt (limited to 'doc/rfc/rfc6497.txt') diff --git a/doc/rfc/rfc6497.txt b/doc/rfc/rfc6497.txt new file mode 100644 index 0000000..a73c143 --- /dev/null +++ b/doc/rfc/rfc6497.txt @@ -0,0 +1,843 @@ + + + + + + +Internet Engineering Task Force (IETF) M. Davis +Request for Comments: 6497 Google +Category: Informational A. Phillips +ISSN: 2070-1721 Lab126 + Y. Umaoka + IBM + C. Falk + Infinite Automata + February 2012 + + + BCP 47 Extension T - Transformed Content + +Abstract + + This document specifies an Extension to BCP 47 that provides subtags + for specifying the source language or script of transformed content, + including content that has been transliterated, transcribed, or + translated, or in some other way influenced by the source. It also + provides for additional information used for identification. + +Status of This Memo + + This document is not an Internet Standards Track specification; it is + published for informational purposes. + + This document is a product of the Internet Engineering Task Force + (IETF). It represents the consensus of the IETF community. It has + received public review and has been approved for publication by the + Internet Engineering Steering Group (IESG). Not all documents + approved by the IESG are a candidate for any level of Internet + Standard; see Section 2 of RFC 5741. + + Information about the current status of this document, any errata, + and how to provide feedback on it may be obtained at + http://www.rfc-editor.org/info/rfc6497. + + + + + + + + + + + + + + + +Davis, et al. Informational [Page 1] + +RFC 6497 BCP 47 Extension T February 2012 + + +Copyright Notice + + Copyright (c) 2012 IETF Trust and the persons identified as the + document authors. All rights reserved. + + This document is subject to BCP 78 and the IETF Trust's Legal + Provisions Relating to IETF Documents + (http://trustee.ietf.org/license-info) in effect on the date of + publication of this document. Please review these documents + carefully, as they describe your rights and restrictions with respect + to this document. Code Components extracted from this document must + include Simplified BSD License text as described in Section 4.e of + the Trust Legal Provisions and are provided without warranty as + described in the Simplified BSD License. + +Table of Contents + + 1. Introduction ....................................................2 + 1.1. Requirements Language ......................................4 + 2. BCP 47 Required Information .....................................4 + 2.1. Overview ...................................................4 + 2.2. Structure ..................................................6 + 2.3. Canonicalization ...........................................7 + 2.4. BCP 47 Registration Form ...................................8 + 2.5. Field Definitions ..........................................8 + 2.6. Registration of Field Subtags .............................10 + 2.7. Registration of Additional Fields .........................11 + 2.8. Committee Responses to Registration Proposals .............11 + 2.9. Machine-Readable Data .....................................11 + 3. Acknowledgements ...............................................14 + 4. IANA Considerations ............................................14 + 5. Security Considerations ........................................14 + 6. References .....................................................14 + 6.1. Normative References ......................................14 + 6.2. Informative References ....................................15 + +1. Introduction + + [BCP47] permits the definition and registration of language tag + extensions "that contain a language component and are compatible with + applications that understand language tags". This document defines + an extension for specifying the source of content that has been + transformed, including text that has been transliterated, + transcribed, or translated, or in some other way influenced by the + source. It may be used in queries to request content that has been + transformed. The "singleton" identifier for this extension is 't'. + + + + + +Davis, et al. Informational [Page 2] + +RFC 6497 BCP 47 Extension T February 2012 + + + Language tags, as defined by [BCP47], are useful for identifying the + language of content. There are mechanisms for specifying variant + subtags for special purposes. However, these variants are + insufficient for specifying content that has undergone + transformations, including content that has been transliterated, + transcribed, or translated. The correct interpretation of the + content may depend upon knowledge of the conventions used for the + transformation. + + Suppose that Italian or Russian cities on a map are transcribed for + Japanese users. Each name needs to be transliterated into katakana + using rules appropriate for the specific source and target language. + When tagging such data, it is important to be able to indicate not + only the resulting content language ("ja" in this case), but also the + source language. + + Transforms such as transliterations may vary, depending not only on + the basis of the source and target script, but also on the source and + target language. Thus, the Russian (which corresponds to the Cyrillic ) + transliterates into "Putin" in English but "Poutine" in French. The + identifier could be used to indicate a desired mechanical + transformation in an API, or could be used to tag data that has been + converted (mechanically or by hand) according to a transliteration + method. + + In addition, many different conventions have arisen for how to + transform text, even between the same languages and scripts. For + example, "Gaddafi" is commonly transliterated from Arabic to English + as any of (G/Q/K/Kh)a(d/dh/dd/dhdh/th/zz)af(i/y). Some examples of + standardized conventions used for transcribing or transliterating + text include: + + a. United Nations Group of Experts on Geographical Names (UNGEGN) + + b. US Library of Congress (LOC) + + c. US Board on Geographic Names (BGN) + + d. Korean Ministry of Culture, Sports and Tourism (MCST) + + e. International Organization for Standardization (ISO) + + The usage of this extension is not limited to formal transformations, + and may include other instances where the content is in some other + way influenced by the source. For example, this extension could be + used to designate a request for a speech recognizer that is tailored + + + + +Davis, et al. Informational [Page 3] + +RFC 6497 BCP 47 Extension T February 2012 + + + specifically for second-language speakers who are first-language + speakers of a particular language (e.g., a recognizer for "English + spoken with a Chinese accent"). + +1.1. Requirements Language + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", + "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this + document are to be interpreted as described in RFC 2119 [RFC2119]. + +2. BCP 47 Required Information + +2.1. Overview + + Identification of transformed content can be done using the 't' + extension defined in this document. This extension is formed by the + 't' singleton followed by a sequence of subtags that would form a + language tag as defined by [BCP47]. This allows the source language + or script to be specified to the degree of precision required. There + are restrictions on the sequence of subtags. They MUST form a + regular, valid, canonical language tag, and MUST neither include + extensions nor private use sequences introduced by the singleton 'x'. + Where only the script is relevant (such as identifying a script- + script transliteration), then 'und' is used for the primary language + subtag. + + For example: + + +---------------------+---------------------------------------------+ + | Language Tag | Description | + +---------------------+---------------------------------------------+ + | ja-t-it | The content is Japanese, transformed from | + | | Italian. | + | ja-Kana-t-it | The content is Japanese Katakana, | + | | transformed from Italian. | + | und-Latn-t-und-cyrl | The content is in the Latin script, | + | | transformed from the Cyrillic script. | + +---------------------+---------------------------------------------+ + + Note that the sequence of subtags governed by 't' cannot contain a + singleton (a single-character subtag), because that would start a new + extension. For example, the tag "ja-t-i-ami" does not indicate that + the source is in "i-ami", because "i-ami" is not a regular language + tag in [BCP47]. That tag would express an empty 't' extension + followed by an 'i' extension. + + + + + + +Davis, et al. Informational [Page 4] + +RFC 6497 BCP 47 Extension T February 2012 + + + The 't' extension is not intended for use in structured data that + already provides separate source and target language identifiers. + For example, this is the case in localization interchange formats + such as XLIFF. In such cases, it would be inappropriate to use + "ja-t-it" for the target language tag because the source language tag + "it" would already be present in the data. Instead, one would use + the language tag "ja". + + As noted earlier, it is sometimes necessary to indicate additional + information about a transformation. This additional information is + optionally supplied after the source in a series of one or more + fields, where each field consists of a field separator subtag + followed by one or more non-separator subtags. Each field separator + subtag consists of a single letter followed by a single digit. + + A transformation mechanism is an optional field that indicates the + specification used for the transformation, such as "UNGEGN" for the + United Nations Group of Experts on Geographical Names + transliterations and transcriptions. It uses the 'm0' field + separator followed by certain subtags. + + For example: + + +------------------------------------+------------------------------+ + | Language Tag | Description | + +------------------------------------+------------------------------+ + | und-Cyrl-t-und-latn-m0-ungegn-2007 | The content is in Cyrillic, | + | | transformed from Latin, | + | | according to a UNGEGN | + | | specification dated 2007. | + +------------------------------------+------------------------------+ + + The field separator subtags, such as 'm0', were chosen because they + are short, visually distinctive, and cannot occur in a language + subtag (outside of an extension and after 'x'), thus eliminating the + potential for collision or confusion with the source language tag. + + The field subtags are defined by Section 3 of Unicode Technical + Standard #35: Unicode Locale Data Markup Language (LDML) [UTS35], the + main specification for the Unicode Common Locale Data Repository + (CLDR) project. That section also defines the parallel 'u' extension + [RFC6067], for which the Unicode Consortium is also the maintaining + authority. As required by BCP 47, subtags follow the language tag + ABNF and other rules for the formation of language tags and subtags, + are restricted to the ASCII letters and digits, are not case + sensitive, and do not exceed eight characters in length. + + + + + +Davis, et al. Informational [Page 5] + +RFC 6497 BCP 47 Extension T February 2012 + + + The LDML specification is available over the Internet and at no cost, + and is available via a royalty-free license at + http://unicode.org/copyright.html. LDML is versioned, and each + version of LDML is numbered, dated, and stable. Extension subtags, + once defined by LDML, are never retracted or substantially changed in + meaning. + + The maintaining authority for the 't' extension is the Unicode + Consortium: + + +---------------+---------------------------------------------------+ + | Item | Value | + +---------------+---------------------------------------------------+ + | Name | Unicode Consortium | + | Contact Email | cldr-contact@unicode.org | + | Discussion | cldr-users@unicode.org | + | List Email | | + | URL Location | cldr.unicode.org | + | Specification | Unicode Technical Standard #35 Unicode Locale | + | | Data Markup Language (LDML), | + | | http://unicode.org/reports/tr35/ | + | Section | Section 3 Unicode Language and Locale Identifiers | + +---------------+---------------------------------------------------+ + +2.2. Structure + + The subtags in the 't' extension are of the following form: + + t-ext = "t" ; Extension + (("-" lang *("-" field)) ; Source + optional field(s) + / 1*("-" field)) ; Field(s) only (no source) + + lang = language ; BCP 47, with restrictions + ["-" script] + ["-" region] + *("-" variant) + + field = fsep 1*("-" 3*8alphanum) ; With restrictions + + fsep = ALPHA DIGIT ; Subtag separators + alphanum = ALPHA / DIGIT + + where ,