From 4bfd864f10b68b71482b35c818559068ef8d5797 Mon Sep 17 00:00:00 2001 From: Thomas Voss Date: Wed, 27 Nov 2024 20:54:24 +0100 Subject: doc: Add RFC documents --- doc/rfc/rfc2070.txt | 2411 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 2411 insertions(+) create mode 100644 doc/rfc/rfc2070.txt (limited to 'doc/rfc/rfc2070.txt') diff --git a/doc/rfc/rfc2070.txt b/doc/rfc/rfc2070.txt new file mode 100644 index 0000000..a49b728 --- /dev/null +++ b/doc/rfc/rfc2070.txt @@ -0,0 +1,2411 @@ + + + + + + +Network Working Group F. Yergeau +Request for Comments: 2070 Alis Technologies +Category: Standards Track G. Nicol + Electronic Book Technologies + G. Adams + Spyglass + M. Duerst + University of Zurich + January 1997 + + + Internationalization of the Hypertext Markup Language + +Status of this Memo + + This document specifies an Internet standards track protocol for the + Internet community, and requests discussion and suggestions for + improvements. Please refer to the current edition of the "Internet + Official Protocol Standards" (STD 1) for the standardization state + and status of this protocol. Distribution of this memo is unlimited. + +Abstract + + The Hypertext Markup Language (HTML) is a markup language used to + create hypertext documents that are platform independent. Initially, + the application of HTML on the World Wide Web was seriously + restricted by its reliance on the ISO-8859-1 coded character set, + which is appropriate only for Western European languages. Despite + this restriction, HTML has been widely used with other languages, + using other coded character sets or character encodings, at the + expense of interoperability. + + This document is meant to address the issue of the + internationalization (i18n, i followed by 18 letters followed by n) + of HTML by extending the specification of HTML and giving additional + recommendations for proper internationalization support. A foremost + consideration is to make sure that HTML remains a valid application + of SGML, while enabling its use with all languages of the world. + +Table of Contents + + 1. Introduction .................................................. 2 + 1.1. Scope ...................................................... 2 + 1.2. Conformance ................................................ 3 + 2. The document character set ..................................... 4 + 2.1. Reference processing model ................................. 4 + 2.2. The document character set ................................. 6 + 2.3. Undisplayable characters ................................... 8 + + + +Yergeau, et. al. Standards Track [Page 1] + +RFC 2070 HTML Internationalization January 1997 + + + 3. The LANG attribute.............................................. 8 + 4. Additional entities, attributes and elements ................... 9 + 4.1. Full Latin-1 entity set .................................... 9 + 4.2. Markup for language-dependent presentation ................ 10 + 5. Forms ..........................................................16 + 5.1. DTD additions ..............................................16 + 5.2. Form submission ............................................17 + 6. External character encoding issues .............................18 + 7. HTML public text ...............................................20 + 7.1. HTML DTD ...................................................20 + 7.2. SGML declaration for HTML ..................................35 + 7.3. ISO Latin 1 character entity set ...........................37 + 8. Security Considerations.........................................40 + Bibliography ......................................................40 + Authors' Addresses ................................................43 + +1. Introduction + + The Hypertext Markup Language (HTML) is a markup language used to + create hypertext documents that are platform independent. Initially, + the application of HTML on the World Wide Web was seriously + restricted by its reliance on the ISO-8859-1 coded character set, + which is appropriate only for Western European languages. Despite + this restriction, HTML has been widely used with other languages, + using other coded character sets or character encodings, through + various ad hoc extensions to the language [TAKADA]. + + This document is meant to address the issue of the + internationalization of HTML by extending the specification of HTML + and giving additional recommendations for proper internationalization + support. It is in good part based on a paper by one of the authors + on multilingualism on the WWW [NICOL]. A foremost consideration is + to make sure that HTML remains a valid application of SGML, while + enabling its use with all languages of the world. + + The specific issues addressed are the SGML document character set to + be used for HTML, the proper treatment of the charset parameter + associated with the "text/html" content type and the specification of + some additional elements and entities. + +1.1 Scope + + HTML has been in use by the World-Wide Web (WWW) global information + initiative since 1990. This specification extends the capabilities + of HTML 2.0 (RFC 1866), primarily by removing the restriction to the + ISO-8859-1 coded character set [ISO-8859]. + + + + + +Yergeau, et. al. Standards Track [Page 2] + +RFC 2070 HTML Internationalization January 1997 + + + HTML is an application of ISO Standard 8879:1986, Information + Processing Text and Office Systems -- Standard Generalized Markup + Language (SGML) [ISO-8879]. The HTML Document Type Definition (DTD) + is a formal definition of the HTML syntax in terms of SGML. This + specification amends the DTD of HTML 2.0 in order to make it + applicable to documents encompassing a character repertoire much + larger than that of ISO-8859-1, while still remaining SGML + conformant. + + Both formal and actual development of HTML are advancing very fast. + The features described in this document are designed so that they can + (and should) be added to other forms of HTML besides that described + in RFC 1866. Where indicated, attributes introduced here should be + extended to the appropriate elements. + +1.2 Conformance + + This specification changes slightly the conformance requirements of + HTML documents and HTML user agents. + +1.2.1 Documents + + All HTML 2.0 conforming documents remain conforming with this + specification. However, the extensions introduced here make valid + certain documents that would not be HTML 2.0 conforming, in + particular those containing characters or character references + outside of the repertoire of ISO 8859-1, and those containing markup + introduced herein. + +1.2.2. User agents + + In addition to the requirements of RFC 1866, the following + requirements are placed on HTML user agents. + + To ensure interoperability and proper support for at least ISO- + 8859-1 in an environment where character encoding schemes other + than ISO-8859-1 are present, user agents MUST correctly interpret + the charset parameter accompanying an HTML document received from + the network. + + Furthermore, conforming user-agents MUST at least parse correctly + all numeric character references within the range of ISO 10646-1 + [ISO-10646]. + + Conforming user-agents are required to apply the BIDI presentation + algorithm if they display right-to-left characters. If there is + no displayable right-to-left character in a document, there is no + need to apply BIDI processing. + + + +Yergeau, et. al. Standards Track [Page 3] + +RFC 2070 HTML Internationalization January 1997 + + +2. The document character set + +2.1. Reference processing model + + This overview explains a reference processing model used for HTML, + and in particular the SGML concept of a document character set. An + actual implementation may widely differ in its internal workings from + the model given below, but should behave as described to an outside + observer. + + Because there are various widely differing encodings of text, SGML + does not directly address how the sequence of characters that + constitutes an SGML document in the abstract sense are encoded by + means of a sequence of octets (or occasionally bit groups of another + length than 8) in a concrete realization of the document such as a + computer file. This encoding is called the external character + encoding of the concrete SGML document, and it should be carefully + distinguished from the document character set of the abstract HTML + document. SGML views the characters as a single set (called a + "character repertoire"), and a "code set" that assigns an integer + number (known as "character number") to each character in the + repertoire. The document character set declaration defines what each + of the character numbers represents [GOLD90, p. 451]. In most cases, + an SGML DTD and all documents that refer to it have a single document + character set, and all markup and data characters are part of this + set. + + HTML, as an application of SGML, does not directly address the + question of the external character encoding. This is deferred to + mechanisms external to HTML, such as MIME as used by the HTTP + protocol or by electronic mail. + + For the HTTP protocol [RFC2068], the external character encoding is + indicated by the "charset" parameter of the "Content-Type" field of + the header of an HTTP response. For example, to indicate that the + transmitted document is encoded in the "JUNET" encoding of Japanese + [RFC1468], the header will contain the following line: + + Content-Type: text/html; charset=ISO-2022-JP + + The term "charset" in MIME is used to designate a character encoding, + rather than merely a coded character set as the term may suggest. A + character encoding is a mapping (possibly many-to-one) of sequences + of octets to sequences of characters taken from one or more character + repertoires. + + The HTTP protocol also defines a mechanism for the client to specify + the character encodings it can accept. Clients and servers are + + + +Yergeau, et. al. Standards Track [Page 4] + +RFC 2070 HTML Internationalization January 1997 + + + strongly requested to use these mechanisms to assure correct + transmission and interpretation of any document. Provisions that can + be taken to help correct interpretation, even in cases where a server + or client do not yet use these mechanisms, are described in section + 6. + + Similarly, if HTML documents are transferred by electronic mail, the + external character encoding is defined by the "charset" parameter of + the "Content-Type" MIME header field [RFC2045], and defaults to US- + ASCII in its absence. + + No mechanisms are currently standardized for indicating the external + character encoding of HTML documents transferred by FTP or accessed + in distributed file systems. + + In the case any other way of transferring and storing HTML documents + are defined or become popular, it is advised that similar provisions + be made to clearly identify the character encoding used and/or to use + a single/default encoding capable of representing the widest range of + characters used in an international context. + + Whatever the external character encoding may be, the reference + processing model translates it to the document character set + specified in Section 2.2 before processing specific to SGML/HTML. + The reference processing model can be depicted as follows: + + [resource]->[decoder]->[entity ]->[ SGML ]->[application]->[display] + [manager] [parser] + ^ | + | | + +----------+ + + The decoder is responsible for decoding the external representation + of the resource to the document character set. The entity manager, + the parser, and the application deal only with characters of the + document character set. A display-oriented part of the application + or the display machinery itself may again convert characters + represented in the document character set to some other + representation more suitable for their purpose. In any case, the + entity manager, the parser, and the application, as far as character + semantics are concerned, are using the HTML document character set + only. + + An actual implementation may choose, or not, to translate the + document into some encoding of the document character set as + described above; the behaviour described by this reference processing + model can be achieved otherwise. This subject is well out of the + scope of this specification, however, and the reader is invited to + + + +Yergeau, et. al. Standards Track [Page 5] + +RFC 2070 HTML Internationalization January 1997 + + + consult the SGML standard [ISO-8879] or an SGML handbook [BRYAN88] + [GOLD90] [VANH90] [SQ91] for further information. + + The most important consequence of this reference processing model is + that numeric character references are always resolved with respect to + the fixed document character set, and thus to the same characters, + whatever the external encoding actually used. For an example, see + Section 2.2. + +2.2. The document character set + + The document character set, in the SGML sense, is the Universal + Character Set (UCS) of ISO 10646:1993 [ISO-10646], as amended. + Currently, this is code-by-code identical with the Unicode standard, + version 1.1 [UNICODE]. + + NOTE -- implementers should be aware that ISO 10646 is amended + from time to time; 4 amendments have been adopted since the + initial 1993 publication, none of which significantly affects this + specification. A fifth amendment, now under consideration, will + introduce incompatible changes to the standard: 6556 Korean Hangul + syllables allocated between code positions 3400 and 4DFF + (hexadecimal) will be moved to new positions (and 4516 new + syllables added), thus making references to the old positions + invalid. Since the Unicode consortium has already adopted a + corresponding amendment for inclusion in the forthcoming Unicode + 2.0, adoption of DAM 5 is considered likely and implementers + should probably consider the old code positions as already + invalid. Despite this one-time change, the relevant standard + bodies have committed themselves not to change any allocated code + position in the future. To encode Korean Hangul irrespective of + these changes, the conjoining Hangul Jamo in the range 1110-11F9 + can be used. + + The adoption of this document character set implies a change in the + SGML declaration specified in the HTML 2.0 specification (section 9.5 + of [RFC1866]). The change amounts to removing the first BASESET + specification and its accompanying DESCSET declaration, replacing + them with the following declaration: + + + + + + + + + + + + +Yergeau, et. al. Standards Track [Page 6] + +RFC 2070 HTML Internationalization January 1997 + + + BASESET "ISO Registration Number 177//CHARSET + ISO/IEC 10646-1:1993 UCS-4 with implementation level 3 + //ESC 2/5 2/15 4/6" + DESCSET 0 9 UNUSED + 9 2 9 + 11 2 UNUSED + 13 1 13 + 14 18 UNUSED + 32 95 32 + 127 1 UNUSED + 128 32 UNUSED + 160 2147483486 160 + + Making the UCS the document character set does not create non- + conformance of any expression, construct or document that is + conforming to HTML 2.0. It does make conforming certain constructs + that are not admissible in HTML 2.0. One consequence is that data + characters outside the repertoire of ISO-8859-1, but within that of + UCS-4 become valid SGML characters. Another is that the upper limit + of the range of numeric character references is extended from 255 to + 2147483645; thus, И is a valid reference to a "CYRILLIC CAPITAL + LETTER I". [ERCS] is a good source of information on Unicode and + SGML, although its scope and technical content differ greatly from + this specification. + + NOTE -- the above SGML declaration, like that of HTML 2.0, + specifies the character numbers 128 to 159 (80 to 9F hex) as + UNUSED. This means that numeric character references within that + range (e.g. ’) are illegal in HTML. Neither ISO 8859-1 nor + ISO 10646 contain characters in that range, which is reserved for + control characters. + + Another change was made from the HTML 2.0 SGML declaration, in the + belief that the latter did not express its authors' true intent. The + syntax character set declaration was changed from ISO 646.IRV:1983 to + the newer ISO 646.IRV:1991, the latter, but not the former, being + identical with US-ASCII. In principle, this introduces an + incompatibility with HTML 2.0, but in practice it should increase + interoperability by i) having the SGML declaration say what everyone + thinks and ii) making the syntax character set a proper subset of the + document character set. The characters that differ between the two + versions of ISO 646.IRV are not actually used to express HTML syntax. + + ISO 10646-1:1993 is the most encompassing character set currently + existing, and there is no other character set that could take its + place as the document character set for HTML. If nevertheless for a + specific application there is a need to use characters outside this + standard, this should be done by avoiding any conflicts with present + + + +Yergeau, et. al. Standards Track [Page 7] + +RFC 2070 HTML Internationalization January 1997 + + + or future versions of ISO 10646, i.e. by assigning these characters + to a private zone of the UCS-4 coding space [ISO-10646 section 11]. + Also, it should be borne in mind that such a use will be highly + unportable; in many cases, it may be better to use inline bitmaps. + +2.3. Undisplayable characters + + With the document character set being the full ISO 10646, the + possibility that a character cannot be displayed due to lack of + appropriate resources (fonts) cannot be avoided. Because there are + many different things that can be done in such a case, this document + does not prescribe any specific behaviour. Depending on the + implementation, this may also be handled by the underlaying display + system and not the application itself. The following considerations, + however, may be of help: + + - A clearly visible, but unobtrusive behaviour should be preferred. + Some documents may contain many characters that cannot be + rendered, and so showing an alert for each of them is not the + right thing to do. + + - In case a numeric representation of the missing character is + given, its hexadecimal (not decimal) form is to be preferred, + because this form is used in character set standards [ERCS]. + +3. The LANG attribute + + Language tags can be used to control rendering of a marked up + document in various ways: glyph disambiguation, in cases where the + character encoding is not sufficient to resolve to a specific glyph; + quotation marks; hyphenation; ligatures; spacing; voice synthesis; + etc. Independently of rendering issues, language markup is useful as + content markup for purposes such as classification and searching. + + Since any text can logically be assigned a language, almost all HTML + elements admit the LANG attribute. The DTD reflects this; the only + elements in this version of HTML without the LANG attribute are BR, + HR, BASE, NEXTID, and META. It is also intended that any new element + introduced in later versions of HTML will admit the LANG attribute, + unless there is a good reason not to do so. + + The language attribute, LANG, takes as its value a language tag that + identifies a natural language spoken, written, or otherwise conveyed + by human beings for communication of information to other human + beings. Computer languages are explicitly excluded. + + + + + + +Yergeau, et. al. Standards Track [Page 8] + +RFC 2070 HTML Internationalization January 1997 + + + The syntax and registry of HTML language tags is the same as that + defined by RFC 1766 [RFC1766]. In summary, a language tag is composed + of one or more parts: A primary language tag and a possibly empty + series of subtags: + + language-tag = primary-tag *( "-" subtag ) + primary-tag = 1*8ALPHA + subtag = 1*8ALPHA + + Whitespace is not allowed within the tag and all tags are case- + insensitive. The namespace of language tags is administered by the + IANA. Example tags include: + + en, en-US, en-cockney, i-cherokee, x-pig-latin + + In the context of HTML, a language tag is not to be interpreted as a + single token, as per RFC 1766, but as a hierarchy. For example, a + user agent that adjusts rendering according to language should + consider that it has a match when a language tag in a style sheet + entry matches the initial portion of the language tag of an element. + An exact match should be preferred. This interpretation allows an + element marked up as, for instance, "en-US" to trigger styles + corresponding to, in order of preference, US-English ("en-US") or + 'plain' or 'international' English ("en"). + + NOTE -- using the language tag as a hierarchy does not imply that + all languages with a common prefix will be understood by those + fluent in one or more of those languages; it simply allows the + user to request this commonality when it is true for that user. + + The rendering of elements may be affected by the LANG attribute. For + any element, the value of the LANG attribute overrides the value + specified by the LANG attribute of any enclosing element and the + value (if any) of the HTTP Content-Language header. If none of these + are set, a suitable default, perhaps controlled by user preferences, + by automatic context analysis or by the user's locale, should be used + to control rendering. + +4. Additional entities, attributes and elements + +4.1. Full Latin-1 entity set + + According to the suggestion of section 14 of [RFC1866], the set of + Latin-1 entities is extended to cover the whole right part of ISO- + 8859-1 (all code positions with the high-order bit set), including + the already commonly used  , © and ®. The names of the + entities are taken from the appendices of SGML [ISO-8879]. A list is + provided in section 7.3 of this specification. + + + +Yergeau, et. al. Standards Track [Page 9] + +RFC 2070 HTML Internationalization January 1997 + + +4.2. Markup for language-dependent presentation + +4.2.1. Overview + + For the correct presentation of text in certain languages + (irrespective of formatting issues), some support in the form of + additional entities and elements is needed. + + In particular, the following features are dealt with: + + - Markup of bidirectional text, i.e. text where left-to-right and + right-to-left scripts are mixed. + + - Control of cursive joining behaviour in contexts where the + default behaviour is not appropriate. + + - Language-dependent rendering of short (in-line) quotations. + + - Better justification control for languages where this is + important. + + - Superscripts and subscripts for languages where they appear as + part of general text. + + Some of the above features need very little additional support; + others need more. The additional features are introduced below with + brief comments only. Explanations on cursive joining behaviour and + bidirectional text follow later. For cursive joining behaviour and + bidirectional text, this document follows [UNICODE] in that: i) + character semantics, where applicable, are identical to [UNICODE], + and ii) where functionality is moved to HTML as a higher level + protocol, this is done in a way that allows straightforward + conversion to the lower-level mechanisms defined in [UNICODE]. + +4.2.2. List of entities, elements, and attributes + + First, a generic container is needed to carry the LANG and DIR (see + below) attributes in cases where no other element is appropriate; the + SPAN element is introduced for that purpose. + + + + + + + + + + + + +Yergeau, et. al. Standards Track [Page 10] + +RFC 2070 HTML Internationalization January 1997 + + + A set of named character entities is added for use with bidirectional + rendering and cursive joining control: + + + + + + + These entities can be used in place of the corresponding formatting + characters whenever convenient, for example to ease keyboard entry or + when a formatting character is not available in the character + encoding of the document. + + Next, an attribute called DIR is introduced, restricted to the values + LTR (left-to-right) and RTL (right-to-left), for the indication of + directionality in the context of bidirectional text (see 4.2.4 below + for details). Since any text and many other elements (e.g. tables) + can logically be assigned a directionality, all elements except BR, + HR, BASE, NEXTID, and META admit this attribute. The DTD reflects + this. It is also intended that any new element introduced in later + versions of HTML will admit the DIR attribute, unless there is a good + reason not to do so. + + A new phrase-level element called BDO (BIDI Override) is introduced, + which requires the DIR attribute to specify whether the override is + left-to-right or right-to-left. This element is required for + bidirectional text control; for detailed explanations, see section + 4.2.4. + + The phrase-level element Q is introduced to allow language-dependent + rendering of short quotations depending on language and platform + capability. As the following examples show (rather poorly, because of + the character set restriction of Internet specifications), the + quotation marks surrounding the quotation are particularly affected: + "a quotation in English", `another, slightly better one', ,,a + quotation in German'', << a quotation in French >>. The contents of + the Q element does not include quotation marks, which have to be + added by the rendering process. + + NOTE -- Q elements can be nested. Many languages use different + quotation styles for outer and inner quotations, and this should + be respected by user-agents implementing this element. + + + + + + + + + +Yergeau, et. al. Standards Track [Page 11] + +RFC 2070 HTML Internationalization January 1997 + + + NOTE -- minimal support for the Q element is to surround the + contents with some kind of quotes, like the plain ASCII double + quotes. As this is rather easy to implement, and as the lack of + any visible quotes may affect the perceived meaning of the text, + user-agent implementors are strongly requested to provide at least + this minimal level of support. + + Many languages require superscript text for proper rendering: as an + example, the French "Mlle Dupont" should have "lle" in superscript. + The SUP element, and its sibling SUB for subscript text, are + introduced to allow proper markup of such text. SUP and SUB contents + are restricted to PCDATA to avoid nesting problems. + + Finally, in many languages text justification is much more important + than it is in Western languages, and justifies markup. The ALIGN + attribute, admitting values of LEFT, RIGHT, CENTER and JUSTIFY, is + added to a selection of elements where it makes sense (the block-like + P, HR, H1 to H6, OL, UL, DIR, MENU, LI, BLOCKQUOTE and ADDRESS). If + a user-agent chooses to have LEFT as a default for blocks of left- + to-right directionality, it should use RIGHT for blocks of right-to- + left directionality. + + NOTE -- RFC 1866 section 4.2.2 specifies that an HTML user agent + should treat an end of line as a word space, except in + preformatted text. This should be interpreted in the context of + the script being processed, as the way words are separated in + writing is script-dependent. For some scripts (e.g. Latin), a + word space is just a space, but in other scripts (e.g. Thai) it is + a zero-width word separator, whereas in yet other scripts (e.g. + Japanese) it is nothing at all, i.e. totally ignored. + + NOTE -- the SOFT HYPHEN character (U+00AD) needs special attention + from user-agent implementers. It is present in many character + sets (including the whole ISO 8859 series and, of course, ISO + 10646), and can always be included by means of the reference + ­. Its semantics are different from the plain HYPHEN: it + indicates a point in a word where a line break is allowed. If the + line is indeed broken there, a hyphen must be displayed at the end + of the first line. If not, the character is not dispalyed at all. + In operations like searching and sorting, it must always be + ignored. + + + + + + + + + + +Yergeau, et. al. Standards Track [Page 12] + +RFC 2070 HTML Internationalization January 1997 + + + In the DTD, the LANG and DIR attributes are grouped together in a + parameter entity called attrs. To parallel RFC 1942 [RFC1942], the + ID and CLASS attributes are also included in attrs. The ID and CLASS + attributes are required for use with style sheets, and RFC 1942 + defines them as follows: + +ID Used to define a document-wide identifier. This can be used + for naming positions within documents as the destination of a + hypertext link. It may also be used by style sheets for + rendering an element in a unique style. An ID attribute value is + an SGML NAME token. NAME tokens are formed by an initial + letter followed by letters, digits, "-" and "." characters. The + letters are restricted to A-Z and a-z. + +CLASS A space separated list of SGML NAME tokens. CLASS names + specify that the element belongs to the corresponding named + classes. It allows authors to distinguish different roles + played by the same tag. The classes may be used by style + sheets to provide different renderings as appropriate to + these roles. + +4.2.3. Cursive joining behaviour + + Markup is needed in some cases to force cursive joining behavior in + contexts in which it would not normally occur, or to block it when it + would normally occur. + + The zero-width joiner and non-joiner (‍ and ‌) are used to + control cursive joining behaviour. For example, ARABIC LETTER HEH is + used in isolation to abbreviate "Hijri" (the Islamic calendrical + system); however, the initial form of the letter is desired, because + the isolated form of HEH looks like the digit five as employed in + Arabic script. This is obtained by following the HEH with a zero- + width joiner whose only effect is to provide context. In Persian + texts, there are cases where a letter that normally would join a + subsequent letter in a cursive connection does not. Here a zero- + width non- joiner is used. + +4.2.4. Bidirectional text + + Many languages are written in horizontal lines from left to right, + while others are written from right to left. When both writing + directions are present, one talks of bidirectional text (BIDI for + short). BIDI text requires markup in special circumstances where + ambiguities as to the directionality of some characters have to be + resolved. This markup affects the ability to render BIDI text in a + semantically legible fashion. That is, without this special BIDI + markup, cases arise which would prevent *any* rendering whatsoever + + + +Yergeau, et. al. Standards Track [Page 13] + +RFC 2070 HTML Internationalization January 1997 + + + that reflected the basic meaning of the text. Plain text may contain + BIDI markup in the form of special-purpose formatting characters. + + This is also possible in HTML, which includes the five BIDI-related + formatting characters (202A - 202E) of ISO 10646. As an alternative, + HTML provides equivalent SGML markup. + + BIDI is a complex issue, and conversion of logical text sequences to + display sequences has to be done according to the algorithm and + character properties specified in [UNICODE]. Here, explanations are + given only as far as they are needed to understand the necessity of + the features introduced and to define their exact semantics. + + The Unicode BIDI algorithm is based on the individual characters of a + text being stored in logical order, that is the order in which they + are normally input and in which the corresponding sounds are normally + spoken. To make rendering of logical order text possible, the + algorithm assigns a directionality property to each character, e.g. + Latin letters are specified to have a left-to-right direction, Arabic + and Hebrew characters have a right-to-left direction. + + The left-to-right and right-to-left marks (‎ and ‏) are used + to disambiguate directionality of neutral characters. For example, + when a double quote sits between an Arabic and a Latin letter, its + direction is ambiguous; if a directional mark is added on one side + such that the quotation mark is surrounded by characters of only one + directionality, the ambiguity is removed. These characters are like + zero width spaces which have a directional property (but no word/line + break property). + + Nested embeddings of contra-directional text runs, due to nested + quotations or to the pasting of text from one BIDI context to + another, is also a case where the implicit directionality of + characters is not sufficient, requiring markup. Also, it is + frequently desirable to specify the basic directionality of a block + of text. For these purposes, the DIR attribute is used. + + On block-type elements, the DIR attribute indicates the base + directionality of the text in the block; if omitted it is inherited + from the parent element. The default directionality of the overall + HTML document is left-to-right. + + On inline elements, it makes the element start a new embedding level + (to be explained below); if omitted the inline element does not start + a new embedding level. + + + + + + +Yergeau, et. al. Standards Track [Page 14] + +RFC 2070 HTML Internationalization January 1997 + + + NOTE -- the PRE, XMP and LISTING elements admit the DIR attribute. + Their contents should not be considered as preformatted with + respect to bidirectional layout, but the BIDI algorithm should be + applied to each line of text. + + Following is an example of a case where embedding is needed, showing + its effect: + + Given the following latin (upper case) and arabic (lower case) + letters in backing store with the specified embeddings: + + AB xy CD zw + EF + + One gets the following rendering (with [] showing the directional + transitions): + + [ AB [ wz [ CD ] yx ] EF ] + + On the other hand, without this markup and with a base direction + of LTR one gets the following rendering: + + [ AB [ yx ] CD [ wz ] EF ] + + Notice that yx is on the left and wz on the right unlike the above + case where the embedding levels are used. Without the embedding + markup one has at most two levels: a base directional level and a + single counterflow directional level. + + The DIR attribute on inline elements is equivalent to the formatting + characters LEFT-TO-RIGHT EMBEDDING (202A) and RIGHT-TO-LEFT + EMBEDDING (202B) of ISO 10646. The end tag of the element is + equivalent to the POP DIRECTIONAL FORMATTING (202C) character. + + Directional override, as provided by the BDO element, is needed to + deal with unusual short pieces of text in which directionality cannot + be resolved from context in an unambiguous fashion. For example, it + can be used to force left-to-right (or right-to-left) display of part + numbers composed of Latin letters, digits and Hebrew letters. + + The effect of BDO is to force the directionality of all characters + within it to the value of DIR, irrespective of their intrinsic + directional properties. It is equivalent to using the LEFT-TO-RIGHT + OVERRIDE (202D) or RIGHT-TO-LEFT OVERRIDE (202E) characters of ISO + 10646, the end tag again being equivalent to the POP DIRECTIONAL + FORMATTING (202C) character. + + + + + +Yergeau, et. al. Standards Track [Page 15] + +RFC 2070 HTML Internationalization January 1997 + + + NOTE -- authors and authoring software writers should be aware + that conflicts can arise if the DIR attribute is used on inline + elements (including BDO) concurrently with the use of the + corresponding ISO 10646 formatting characters. + + Preferably one or the other should be used exclusively; the markup + method is better able to guarantee document structural integrity, + and alleviates some problems when editing bidirectional HTML text + with a simple text editor, but some software may be more apt at + using the 10646 characters. If both methods are used, great care + should be exercised to insure proper nesting of markup and + directional embedding or override; otherwise, rendering results + are undefined. + +5. Forms + +5.1. DTD additions + + It is natural to expect input in any language in forms, as they + provide one of the only ways of obtaining user input. While this is + primarily a UI issue, there are some things that should be specified + at the HTML level to guide behavior and promote interoperability. + + To ensure full interoperability, it is necessary for the user agent + (and the user) to have an indication of the character encoding(s) + that the server providing a form will be able to handle upon + submission of the filled-in form. Such an indication is provided by + the ACCEPT-CHARSET attribute of the INPUT and TEXTAREA elements, + modeled on the HTTP Accept-Charset header (see [HTTP-1.1]), which + contains a space and/or comma delimited list of character sets + acceptable to the server. A user agent may want to somehow advise + the user of the contents of this attribute, or to restrict his + possibility to enter characters outside the repertoires of the listed + character sets. + + NOTE -- The list of character sets is to be interpreted as an + EXCLUSIVE-OR list; the server announces that it is ready to accept + any ONE of these character encoding schemes for each part of a + multipart entity. The client may perform character encoding + translation to satisfy the server if necessary. + + NOTE -- The default value for the ACCEPT-CHARSET attribute of an + INPUT or TEXTAREA element is the reserved value "UNKNOWN". A user + agent may interpret that value as the character encoding scheme + that was used to transmit the document containing that element. + + + + + + +Yergeau, et. al. Standards Track [Page 16] + +RFC 2070 HTML Internationalization January 1997 + + +5.2. Form submission + + The HTML 2.0 form submission mechanism, based on the "application/x- + www-form-urlencoded" media type, is ill-equipped with regard to + internationalization. In fact, since URLs are restricted to ASCII + characters, the mechanism is akward even for ISO-8859-1 text. + Section 2.2 of [RFC1738] specifies that octets may be encoded using + the "%HH" notation, but text submitted from a form is composed of + characters, not octets. Lacking a specification of a character + encoding scheme, the "%HH" notation has no well-defined meaning. + + The best solution is to use the "multipart/form-data" media type + described in [RFC1867] with the POST method of form submission. This + mechanism encapsulates the value part of each name-value pair in a + body-part of a multipart MIME body that is sent as the HTTP entity; + each body part can be labeled with an appropriate Content-Type, + including if necessary a charset parameter that specifies the + character encoding scheme. The changes to the DTD necessary to + support this method of form submission have been incorporated in the + DTD included in this specification. + + A less satisfactory solution is to add a MIME charset parameter to + the "application/x-www-form-urlencoded" media type specifier sent + along with a POST method form submission, with the understanding that + the URL encoding of [RFC1738] is applied on top of the specified + character encoding, as a kind of implicit Content-Transfer-Encoding. + + One problem with both solutions above is that current browsers do not + generally allow for bookmarks to specify the POST method; this should + be improved. Conversely, the GET method could be used with the form + data transmitted in the body instead of in the URL. Nothing in the + protocol seems to prevent it, but no implementations appear to exist + at present. + + How the user agent determines the encoding of the text entered by the + user is outside the scope of this specification. + + NOTE -- Designers of forms and their handling scripts should be + aware of an important caveat: when the default value of a field + (the VALUE attribute) is returned upon form submission (i.e. the + user did not modify this value), it cannot be guaranteed to be + transmitted as a sequence of octets identical to that in the + source document -- only as a possibly different but valid encoding + of the same sequence of text elements. This may be true even if + the encoding of the document containing the form and that used for + submission are the same. + + + + + +Yergeau, et. al. Standards Track [Page 17] + +RFC 2070 HTML Internationalization January 1997 + + + Differences can occur when a sequence of characters can be + represented by various sequences of octets, and also when a + composite sequence (a base character plus one or more combining + diacritics) can be represented by either a different but + equivalent composite sequence or by a fully precomposed character. + For instance, the UCS-2 sequence 00EA+0323 (LATIN SMALL LETTER E + WITH CIRCUMFLEX ACCENT + COMBINING DOT BELOW) may be transformed + into 1EC7 (LATIN SMALL LETTER E WITH CIRCUMFLEX ACCENT AND DOT + BELOW), into 0065+0302+0323 (LATIN SMALL LETTER E + COMBINING + CIRCUMFLEX ACCENT + COMBINING DOT BELOW), as well as into other + equivalent composite sequences. + +6. External character encoding issues + + Proper interpretation of a text document requires that the character + encoding scheme be known. Current HTTP servers, however, do not + generally include an appropriate charset parameter with the Content- + Type header. This is bad behaviour, which is even encouraged by the + continued existence of browsers that declare an unrecognized media + type when they receive a charset parameter. User agent + implementators are strongly encouraged to make their software + tolerant of this parameter, even if they cannot take advantage of it. + Proper labelling is highly desirable, but some preventive measures + can be taken to minimize the detrimental effects of its absence: + + In the case where a document is accessed from a hyperlink in an + origin HTML document, a CHARSET attribute is added to the attribute + list of elements with link semantics (A and LINK), specifically by + adding it to the linkExtraAttributes entity. The value of that + attribute is to be considered a hint to the User Agent as to the + character encoding scheme used by the resource pointed to by the + hyperlink; it should be the appropriate value of the MIME charset + parameter for that resource. + + In any document, it is possible to include an indication of the + encoding scheme like the following, as early as possible within the + HEAD of the document: + + + + This is not foolproof, but will work if the encoding scheme is such + that ASCII-valued octets stand for ASCII characters only at least + until the META element is parsed. Note that there are better ways + for a server to obtain character encoding information, instead of the + unreliable META above; see [NICOL2] for some details and a proposal. + + + + + +Yergeau, et. al. Standards Track [Page 18] + +RFC 2070 HTML Internationalization January 1997 + + + For definiteness, the "charset" parameter received from the source of + the document should be considered the most authoritative, followed in + order of preference by the contents of a META element such as the + above, and finally the CHARSET parameter of the anchor that was + followed (if any). + + When HTML text is transmitted directly in UCS-2 or UCS-4 form, the + question of byte order arises: does the high-order byte of each + multi-byte character come first or last? For definiteness, this + specification recommends that UCS-2 and UCS-4 be transmitted in big- + endian byte order (high order byte first), which corresponds to the + established network byte order for two- and four-byte quantities, to + the ISO 10646 requirement and Unicode recommendation for serialized + text data and to RFC 1641. Furthermore, to maximize chances of + proper interpretation, it is recommended that documents transmitted + as UCS-2 or UCS-4 always begin with a ZERO-WIDTH NON-BREAKING SPACE + character (hexadecimal FEFF or 0000FEFF) which, when byte-reversed + becomes number FFFE or FFFE0000, a character guaranteed to be never + assigned. Thus, a user-agent receiving an FFFE as the first octets + of a text would know that bytes have to be reversed for the remainder + of the text. + + There exist so-called UCS Transformation Formats than can be used to + transmit UCS data, in addition to UCS-2 and UCS-4. UTF-7 [RFC1642] + and UTF-8 [UTF-8] have favorable properties (no byte-ordering + problem, different flavours of ASCII compatibility) that make them + worthy of consideration, especially for transmission of multilingual + text. Another encoding scheme, MNEM [RFC1345], also has interesting + properties and the capability to transmit the full UCS. The UTF-1 + transformation format of ISO 10646:1993 (registered by IANA as ISO- + 10646-UTF-1), has been removed from ISO 10646 by amendment 4, and + should not be used. + + + + + + + + + + + + + + + + + + + +Yergeau, et. al. Standards Track [Page 19] + +RFC 2070 HTML Internationalization January 1997 + + +7. HTML Public Text + +7.1. HTML DTD + + This section contains a DTD for HTML based on the HTML 2.0 DTD of RFC + 1866, incorporating the changes for file upload as specified in RFC + 1867, and the changes deriving from this document. + + + + + + ... + + -- + > + + + + + + + + ]]> + + + +Yergeau, et. al. Standards Track [Page 20] + +RFC 2070 HTML Internationalization January 1997 + + + + + + + + + + + + + + + + + + + + + + + + + + + + %ISOlat1; + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ]]> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Yergeau, et. al. Standards Track [Page 24] + +RFC 2070 HTML Internationalization January 1997 + + + + + + + + + + + + Heading + is preferred to +

Heading

+ --> + ]]> + + + + + " + > + + + + + + + + + + + + + + + + +Yergeau, et. al. Standards Track [Page 25] + +RFC 2070 HTML Internationalization January 1997 + + + + #AttVal(Alt)" + > + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ]]> + + + + + ]]> + + + + + +Yergeau, et. al. Standards Track [Page 27] + +RFC 2070 HTML Internationalization January 1997 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ]]> + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Yergeau, et. al. Standards Track [Page 29] + +RFC 2070 HTML Internationalization January 1997 + + + + Directory" + > + Menu" + > + + + + + + + + + + + + + + Heading +

Text ... + is preferred to +

Heading

+ Text ... + --> + ]]> + + + + + + + + + + + + + + + + + + + + + + + + + Form:" + %SDASUFF; "Form End." + > + + + + + + + + + + + + + + + + + + + + + + + + + + Select #AttVal(Multiple)" + > + + + + + + + + + + + + + + + + + + + + + + ]]> + + + + + + ]]> + + + + + + + + + + + + + + + + " > + + + + + + + + + + + + [Document is indexed/searchable.]"> + + + + + + + + + + + + + + + + + + + + +Yergeau, et. al. Standards Track [Page 34] + +RFC 2070 HTML Internationalization January 1997 + + + + + + + + + + + ]]> + + + + + + + + + +7.2. SGML Declaration for HTML + + + +7.3. ISO Latin 1 entity set + + The following public text lists each of the characters specified in + the Added Latin 1 entity set, along with its name, syntax for use, + and description. This list is derived from ISO Standard + 8879:1986//ENTITIES Added Latin 1//EN. HTML includes the entire + entity set, and adds entities for all missing characters in the right + part of ISO-8859-1. + + + + + + + + + + + + + +Yergeau, et. al. Standards Track [Page 37] + +RFC 2070 HTML Internationalization January 1997 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Yergeau, et. al. Standards Track [Page 38] + +RFC 2070 HTML Internationalization January 1997 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Yergeau, et. al. Standards Track [Page 39] + +RFC 2070 HTML Internationalization January 1997 + + +8. Security Considerations + + Anchors, embedded images, and all other elements which contain URIs + as parameters may cause the URI to be dereferenced in response to + user input. In this case, the security considerations of [RFC1738] + apply. + + The widely deployed methods for submitting form requests -- HTTP and + SMTP -- provide little assurance of confidentiality. Information + providers who request sensitive information via forms -- especially + by way of the `PASSWORD' type input field (see section 8.1.2 in + [RFC1866]) -- should be aware and make their users aware of the lack + of confidentiality. + +Bibliography + + [BRYAN88] M. Bryan, "SGML -- An Author's Guide to the Standard + Generalized Markup Language", Addison-Wesley, Reading, + 1988. + + [ERCS] Extended Reference Concrete Syntax for SGML. + + + [GOLD90] C. F. Goldfarb, "The SGML Handbook", Y. Rubinsky, Ed., + Oxford University Press, 1990. + + [HTTP-1.1] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., + and T. Berners-Lee, "Hypertext Transfer Protocol -- + HTTP/1.1", RFC 2068, January 1997. + + [ISO-639] ISO 639:1988. International standard -- Code for the + representation of the names of languages. Technical + content in + + [ISO-8859] ISO 8859. International standard -- Information pro- + cessing -- 8-bit single-byte coded graphic character + sets -- Part 1: Latin alphabet No. 1 (1987) -- Part 2: + Latin alphabet No. 2 (1987) -- Part 3: Latin alphabet + No. 3 (1988) -- Part 4: Latin alphabet No. 4 (1988) -- + Part 5: Latin/Cyrillic alphabet (1988) -- Part 6: + Latin/Arabic alphabet (1987) -- Part : Latin/Greek + alphabet (1987) -- Part 8: Latin/Hebrew alphabet + (1988) -- Part 9: Latin alphabet No. 5 (1989) -- Part + 10: Latin alphabet No. 6 (1992) + + + + + + +Yergeau, et. al. Standards Track [Page 40] + +RFC 2070 HTML Internationalization January 1997 + + + [ISO-8879] ISO 8879:1986. International standard -- Information + processing -- Text and office systems -- Standard gen- + eralized markup language (SGML). + + [ISO-10646] ISO/IEC 10646-1:1993. International standard -- Infor- + mation technology -- Universal multiple-octet coded + character Sset (UCS) -- Part 1: Architecture and basic + multilingual plane. + + [NICOL] G.T. Nicol, "The Multilingual World Wide Web", + Electronic Book Technologies, 1995, + + + [NICOL2] G.T. Nicol, "MIME Header Supplemented File Type", Work + in Progress, EBT, October 1995. + + [RFC1345] Simonsen, K., "Character Mnemonics & Character Sets", + RFC 1345, Rationel Almen Planlaegning, June 1992. + + [RFC1468] Murai, J., Crispin M., and E. van der Poel, + "Japanese Character Encoding for Internet Messages", + RFC 1468, Keio University, Panda Programming, June + 1993. + + [RFC2045] Freed, N., and N. Borenstein, "Multipurpose Internet + Mail Extensions (MIME) Part One: Format of Internet + Message Bodies", RFC 2045, Innosoft, First Virtual, + November 1996. + + [RFC1641] Goldsmith, D., and M.Davis, "Using Unicode with MIME", + RFC 1641, Taligent inc., July 1994. + + [RFC1642] Goldsmith, D., and M. Davis, "UTF-7: A Mail-safe + Transformation Format of Unicode", RFC 1642, Taligent, + Inc., July 1994. + + [RFC1738] Berners-Lee, T., Masinter, L., and M. McCahill, + "Uniform Resource Locators (URL)", RFC 1738, CERN, + Xerox PARC, University of Minnesota, October 1994. + + [RFC1766] Alverstrand, H., "Tags for the Identification of + Languages", RFC 1766, UNINETT, March 1995. + + [RFC1866] Berners-Lee, T., and D. Connolly, "Hypertext Markup + Language - 2.0", RFC 1866, MIT/W3C, November 1995. + + [RFC1867] Nebel, E., and L. Masinter, "Form-based File Upload + in HTML", RFC 1867, Xerox Corporation, November 1995. + + + +Yergeau, et. al. Standards Track [Page 41] + +RFC 2070 HTML Internationalization January 1997 + + + [RFC1942] Raggett, D., "HTML Tables", RFC 1942, W3C, May 1996. + + [RFC2068] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., + and T. Berners-Lee, "Hypertext Transfer Protocol -- + HTTP/1.1", RFC 2068, January 1997. + + [SQ91] SoftQuad, "The SGML Primer", 3rd ed., SoftQuad Inc., + 1991. + + [TAKADA] Toshihiro Takada, "Multilingual Information Exchange + through the World-Wide Web", Computer Networks and + ISDN Systems, Vol. 27, No. 2, Nov. 1994 , p. 235-241. + + [TEI] TEI Guidelines for Electronic Text Encoding and Inter- + change. + + [UNICODE] The Unicode Consortium, "The Unicode Standard -- + Worldwide Character Encoding -- Version 1.0", Addison- + Wesley, Volume 1, 1991, Volume 2, 1992, and Technical + Report #4, 1993. The BIDI algorithm is in appendix A + of volume 1, with corrections in appendix D of volume + 2. + + [UTF-8] ISO/IEC 10646-1:1993 AMENDMENT 2 (1996). UCS Transfor- + mation Format 8 (UTF-8). + + [VANH90] E. van Hervijnen, "Practical SGML", Kluwer Academicq + Publishers Group, Norwell and Dordrecht, 1990. + + + + + + + + + + + + + + + + + + + + + + + +Yergeau, et. al. Standards Track [Page 42] + +RFC 2070 HTML Internationalization January 1997 + + +Authors' Addresses + + Frangois Yergeau + Alis Technologies + 100, boul. Alexis-Nihon, bureau 600 + Montrial QC H4M 2P2 + Canada + + Tel: +1 (514) 747-2547 + Fax: +1 (514) 747-2561 + EMail: fyergeau@alis.com + + + Gavin Thomas Nicol + Electronic Book Technologies, Japan + 1-29-9 Tsurumaki, + Setagaya-ku, + Tokyo + Japan + + Tel: +81-3-3230-8161 + Fax: +81-3-3230-8163 + EMail: gtn@ebt.com, gtn@twics.co.jp + + + Glenn Adams + Spyglass + 118 Magazine Street + Cambridge, MA 02139 + U.S.A. + + Tel: +1 (617) 864-5524 + Fax: +1 (617) 864-4965 + EMail: glenn@spyglass.com + + + Martin J. Duerst + Multimedia-Laboratory + Department of Computer Science + University of Zurich + Winterthurerstrasse 190 + CH-8057 Zurich + Switzerland + + Tel: +41 1 257 43 16 + Fax: +41 1 363 00 35 + EMail: mduerst@ifi.unizh.ch + + + + +Yergeau, et. al. Standards Track [Page 43] + -- cgit v1.2.3