summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc2070.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/rfc/rfc2070.txt')
-rw-r--r--doc/rfc/rfc2070.txt2411
1 files changed, 2411 insertions, 0 deletions
diff --git a/doc/rfc/rfc2070.txt b/doc/rfc/rfc2070.txt
new file mode 100644
index 0000000..a49b728
--- /dev/null
+++ b/doc/rfc/rfc2070.txt
@@ -0,0 +1,2411 @@
+
+
+
+
+
+
+Network Working Group F. Yergeau
+Request for Comments: 2070 Alis Technologies
+Category: Standards Track G. Nicol
+ Electronic Book Technologies
+ G. Adams
+ Spyglass
+ M. Duerst
+ University of Zurich
+ January 1997
+
+
+ Internationalization of the Hypertext Markup Language
+
+Status of this Memo
+
+ This document specifies an Internet standards track protocol for the
+ Internet community, and requests discussion and suggestions for
+ improvements. Please refer to the current edition of the "Internet
+ Official Protocol Standards" (STD 1) for the standardization state
+ and status of this protocol. Distribution of this memo is unlimited.
+
+Abstract
+
+ The Hypertext Markup Language (HTML) is a markup language used to
+ create hypertext documents that are platform independent. Initially,
+ the application of HTML on the World Wide Web was seriously
+ restricted by its reliance on the ISO-8859-1 coded character set,
+ which is appropriate only for Western European languages. Despite
+ this restriction, HTML has been widely used with other languages,
+ using other coded character sets or character encodings, at the
+ expense of interoperability.
+
+ This document is meant to address the issue of the
+ internationalization (i18n, i followed by 18 letters followed by n)
+ of HTML by extending the specification of HTML and giving additional
+ recommendations for proper internationalization support. A foremost
+ consideration is to make sure that HTML remains a valid application
+ of SGML, while enabling its use with all languages of the world.
+
+Table of Contents
+
+ 1. Introduction .................................................. 2
+ 1.1. Scope ...................................................... 2
+ 1.2. Conformance ................................................ 3
+ 2. The document character set ..................................... 4
+ 2.1. Reference processing model ................................. 4
+ 2.2. The document character set ................................. 6
+ 2.3. Undisplayable characters ................................... 8
+
+
+
+Yergeau, et. al. Standards Track [Page 1]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ 3. The LANG attribute.............................................. 8
+ 4. Additional entities, attributes and elements ................... 9
+ 4.1. Full Latin-1 entity set .................................... 9
+ 4.2. Markup for language-dependent presentation ................ 10
+ 5. Forms ..........................................................16
+ 5.1. DTD additions ..............................................16
+ 5.2. Form submission ............................................17
+ 6. External character encoding issues .............................18
+ 7. HTML public text ...............................................20
+ 7.1. HTML DTD ...................................................20
+ 7.2. SGML declaration for HTML ..................................35
+ 7.3. ISO Latin 1 character entity set ...........................37
+ 8. Security Considerations.........................................40
+ Bibliography ......................................................40
+ Authors' Addresses ................................................43
+
+1. Introduction
+
+ The Hypertext Markup Language (HTML) is a markup language used to
+ create hypertext documents that are platform independent. Initially,
+ the application of HTML on the World Wide Web was seriously
+ restricted by its reliance on the ISO-8859-1 coded character set,
+ which is appropriate only for Western European languages. Despite
+ this restriction, HTML has been widely used with other languages,
+ using other coded character sets or character encodings, through
+ various ad hoc extensions to the language [TAKADA].
+
+ This document is meant to address the issue of the
+ internationalization of HTML by extending the specification of HTML
+ and giving additional recommendations for proper internationalization
+ support. It is in good part based on a paper by one of the authors
+ on multilingualism on the WWW [NICOL]. A foremost consideration is
+ to make sure that HTML remains a valid application of SGML, while
+ enabling its use with all languages of the world.
+
+ The specific issues addressed are the SGML document character set to
+ be used for HTML, the proper treatment of the charset parameter
+ associated with the "text/html" content type and the specification of
+ some additional elements and entities.
+
+1.1 Scope
+
+ HTML has been in use by the World-Wide Web (WWW) global information
+ initiative since 1990. This specification extends the capabilities
+ of HTML 2.0 (RFC 1866), primarily by removing the restriction to the
+ ISO-8859-1 coded character set [ISO-8859].
+
+
+
+
+
+Yergeau, et. al. Standards Track [Page 2]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ HTML is an application of ISO Standard 8879:1986, Information
+ Processing Text and Office Systems -- Standard Generalized Markup
+ Language (SGML) [ISO-8879]. The HTML Document Type Definition (DTD)
+ is a formal definition of the HTML syntax in terms of SGML. This
+ specification amends the DTD of HTML 2.0 in order to make it
+ applicable to documents encompassing a character repertoire much
+ larger than that of ISO-8859-1, while still remaining SGML
+ conformant.
+
+ Both formal and actual development of HTML are advancing very fast.
+ The features described in this document are designed so that they can
+ (and should) be added to other forms of HTML besides that described
+ in RFC 1866. Where indicated, attributes introduced here should be
+ extended to the appropriate elements.
+
+1.2 Conformance
+
+ This specification changes slightly the conformance requirements of
+ HTML documents and HTML user agents.
+
+1.2.1 Documents
+
+ All HTML 2.0 conforming documents remain conforming with this
+ specification. However, the extensions introduced here make valid
+ certain documents that would not be HTML 2.0 conforming, in
+ particular those containing characters or character references
+ outside of the repertoire of ISO 8859-1, and those containing markup
+ introduced herein.
+
+1.2.2. User agents
+
+ In addition to the requirements of RFC 1866, the following
+ requirements are placed on HTML user agents.
+
+ To ensure interoperability and proper support for at least ISO-
+ 8859-1 in an environment where character encoding schemes other
+ than ISO-8859-1 are present, user agents MUST correctly interpret
+ the charset parameter accompanying an HTML document received from
+ the network.
+
+ Furthermore, conforming user-agents MUST at least parse correctly
+ all numeric character references within the range of ISO 10646-1
+ [ISO-10646].
+
+ Conforming user-agents are required to apply the BIDI presentation
+ algorithm if they display right-to-left characters. If there is
+ no displayable right-to-left character in a document, there is no
+ need to apply BIDI processing.
+
+
+
+Yergeau, et. al. Standards Track [Page 3]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+2. The document character set
+
+2.1. Reference processing model
+
+ This overview explains a reference processing model used for HTML,
+ and in particular the SGML concept of a document character set. An
+ actual implementation may widely differ in its internal workings from
+ the model given below, but should behave as described to an outside
+ observer.
+
+ Because there are various widely differing encodings of text, SGML
+ does not directly address how the sequence of characters that
+ constitutes an SGML document in the abstract sense are encoded by
+ means of a sequence of octets (or occasionally bit groups of another
+ length than 8) in a concrete realization of the document such as a
+ computer file. This encoding is called the external character
+ encoding of the concrete SGML document, and it should be carefully
+ distinguished from the document character set of the abstract HTML
+ document. SGML views the characters as a single set (called a
+ "character repertoire"), and a "code set" that assigns an integer
+ number (known as "character number") to each character in the
+ repertoire. The document character set declaration defines what each
+ of the character numbers represents [GOLD90, p. 451]. In most cases,
+ an SGML DTD and all documents that refer to it have a single document
+ character set, and all markup and data characters are part of this
+ set.
+
+ HTML, as an application of SGML, does not directly address the
+ question of the external character encoding. This is deferred to
+ mechanisms external to HTML, such as MIME as used by the HTTP
+ protocol or by electronic mail.
+
+ For the HTTP protocol [RFC2068], the external character encoding is
+ indicated by the "charset" parameter of the "Content-Type" field of
+ the header of an HTTP response. For example, to indicate that the
+ transmitted document is encoded in the "JUNET" encoding of Japanese
+ [RFC1468], the header will contain the following line:
+
+ Content-Type: text/html; charset=ISO-2022-JP
+
+ The term "charset" in MIME is used to designate a character encoding,
+ rather than merely a coded character set as the term may suggest. A
+ character encoding is a mapping (possibly many-to-one) of sequences
+ of octets to sequences of characters taken from one or more character
+ repertoires.
+
+ The HTTP protocol also defines a mechanism for the client to specify
+ the character encodings it can accept. Clients and servers are
+
+
+
+Yergeau, et. al. Standards Track [Page 4]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ strongly requested to use these mechanisms to assure correct
+ transmission and interpretation of any document. Provisions that can
+ be taken to help correct interpretation, even in cases where a server
+ or client do not yet use these mechanisms, are described in section
+ 6.
+
+ Similarly, if HTML documents are transferred by electronic mail, the
+ external character encoding is defined by the "charset" parameter of
+ the "Content-Type" MIME header field [RFC2045], and defaults to US-
+ ASCII in its absence.
+
+ No mechanisms are currently standardized for indicating the external
+ character encoding of HTML documents transferred by FTP or accessed
+ in distributed file systems.
+
+ In the case any other way of transferring and storing HTML documents
+ are defined or become popular, it is advised that similar provisions
+ be made to clearly identify the character encoding used and/or to use
+ a single/default encoding capable of representing the widest range of
+ characters used in an international context.
+
+ Whatever the external character encoding may be, the reference
+ processing model translates it to the document character set
+ specified in Section 2.2 before processing specific to SGML/HTML.
+ The reference processing model can be depicted as follows:
+
+ [resource]->[decoder]->[entity ]->[ SGML ]->[application]->[display]
+ [manager] [parser]
+ ^ |
+ | |
+ +----------+
+
+ The decoder is responsible for decoding the external representation
+ of the resource to the document character set. The entity manager,
+ the parser, and the application deal only with characters of the
+ document character set. A display-oriented part of the application
+ or the display machinery itself may again convert characters
+ represented in the document character set to some other
+ representation more suitable for their purpose. In any case, the
+ entity manager, the parser, and the application, as far as character
+ semantics are concerned, are using the HTML document character set
+ only.
+
+ An actual implementation may choose, or not, to translate the
+ document into some encoding of the document character set as
+ described above; the behaviour described by this reference processing
+ model can be achieved otherwise. This subject is well out of the
+ scope of this specification, however, and the reader is invited to
+
+
+
+Yergeau, et. al. Standards Track [Page 5]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ consult the SGML standard [ISO-8879] or an SGML handbook [BRYAN88]
+ [GOLD90] [VANH90] [SQ91] for further information.
+
+ The most important consequence of this reference processing model is
+ that numeric character references are always resolved with respect to
+ the fixed document character set, and thus to the same characters,
+ whatever the external encoding actually used. For an example, see
+ Section 2.2.
+
+2.2. The document character set
+
+ The document character set, in the SGML sense, is the Universal
+ Character Set (UCS) of ISO 10646:1993 [ISO-10646], as amended.
+ Currently, this is code-by-code identical with the Unicode standard,
+ version 1.1 [UNICODE].
+
+ NOTE -- implementers should be aware that ISO 10646 is amended
+ from time to time; 4 amendments have been adopted since the
+ initial 1993 publication, none of which significantly affects this
+ specification. A fifth amendment, now under consideration, will
+ introduce incompatible changes to the standard: 6556 Korean Hangul
+ syllables allocated between code positions 3400 and 4DFF
+ (hexadecimal) will be moved to new positions (and 4516 new
+ syllables added), thus making references to the old positions
+ invalid. Since the Unicode consortium has already adopted a
+ corresponding amendment for inclusion in the forthcoming Unicode
+ 2.0, adoption of DAM 5 is considered likely and implementers
+ should probably consider the old code positions as already
+ invalid. Despite this one-time change, the relevant standard
+ bodies have committed themselves not to change any allocated code
+ position in the future. To encode Korean Hangul irrespective of
+ these changes, the conjoining Hangul Jamo in the range 1110-11F9
+ can be used.
+
+ The adoption of this document character set implies a change in the
+ SGML declaration specified in the HTML 2.0 specification (section 9.5
+ of [RFC1866]). The change amounts to removing the first BASESET
+ specification and its accompanying DESCSET declaration, replacing
+ them with the following declaration:
+
+
+
+
+
+
+
+
+
+
+
+
+Yergeau, et. al. Standards Track [Page 6]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ BASESET "ISO Registration Number 177//CHARSET
+ ISO/IEC 10646-1:1993 UCS-4 with implementation level 3
+ //ESC 2/5 2/15 4/6"
+ DESCSET 0 9 UNUSED
+ 9 2 9
+ 11 2 UNUSED
+ 13 1 13
+ 14 18 UNUSED
+ 32 95 32
+ 127 1 UNUSED
+ 128 32 UNUSED
+ 160 2147483486 160
+
+ Making the UCS the document character set does not create non-
+ conformance of any expression, construct or document that is
+ conforming to HTML 2.0. It does make conforming certain constructs
+ that are not admissible in HTML 2.0. One consequence is that data
+ characters outside the repertoire of ISO-8859-1, but within that of
+ UCS-4 become valid SGML characters. Another is that the upper limit
+ of the range of numeric character references is extended from 255 to
+ 2147483645; thus, И is a valid reference to a "CYRILLIC CAPITAL
+ LETTER I". [ERCS] is a good source of information on Unicode and
+ SGML, although its scope and technical content differ greatly from
+ this specification.
+
+ NOTE -- the above SGML declaration, like that of HTML 2.0,
+ specifies the character numbers 128 to 159 (80 to 9F hex) as
+ UNUSED. This means that numeric character references within that
+ range (e.g. ’) are illegal in HTML. Neither ISO 8859-1 nor
+ ISO 10646 contain characters in that range, which is reserved for
+ control characters.
+
+ Another change was made from the HTML 2.0 SGML declaration, in the
+ belief that the latter did not express its authors' true intent. The
+ syntax character set declaration was changed from ISO 646.IRV:1983 to
+ the newer ISO 646.IRV:1991, the latter, but not the former, being
+ identical with US-ASCII. In principle, this introduces an
+ incompatibility with HTML 2.0, but in practice it should increase
+ interoperability by i) having the SGML declaration say what everyone
+ thinks and ii) making the syntax character set a proper subset of the
+ document character set. The characters that differ between the two
+ versions of ISO 646.IRV are not actually used to express HTML syntax.
+
+ ISO 10646-1:1993 is the most encompassing character set currently
+ existing, and there is no other character set that could take its
+ place as the document character set for HTML. If nevertheless for a
+ specific application there is a need to use characters outside this
+ standard, this should be done by avoiding any conflicts with present
+
+
+
+Yergeau, et. al. Standards Track [Page 7]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ or future versions of ISO 10646, i.e. by assigning these characters
+ to a private zone of the UCS-4 coding space [ISO-10646 section 11].
+ Also, it should be borne in mind that such a use will be highly
+ unportable; in many cases, it may be better to use inline bitmaps.
+
+2.3. Undisplayable characters
+
+ With the document character set being the full ISO 10646, the
+ possibility that a character cannot be displayed due to lack of
+ appropriate resources (fonts) cannot be avoided. Because there are
+ many different things that can be done in such a case, this document
+ does not prescribe any specific behaviour. Depending on the
+ implementation, this may also be handled by the underlaying display
+ system and not the application itself. The following considerations,
+ however, may be of help:
+
+ - A clearly visible, but unobtrusive behaviour should be preferred.
+ Some documents may contain many characters that cannot be
+ rendered, and so showing an alert for each of them is not the
+ right thing to do.
+
+ - In case a numeric representation of the missing character is
+ given, its hexadecimal (not decimal) form is to be preferred,
+ because this form is used in character set standards [ERCS].
+
+3. The LANG attribute
+
+ Language tags can be used to control rendering of a marked up
+ document in various ways: glyph disambiguation, in cases where the
+ character encoding is not sufficient to resolve to a specific glyph;
+ quotation marks; hyphenation; ligatures; spacing; voice synthesis;
+ etc. Independently of rendering issues, language markup is useful as
+ content markup for purposes such as classification and searching.
+
+ Since any text can logically be assigned a language, almost all HTML
+ elements admit the LANG attribute. The DTD reflects this; the only
+ elements in this version of HTML without the LANG attribute are BR,
+ HR, BASE, NEXTID, and META. It is also intended that any new element
+ introduced in later versions of HTML will admit the LANG attribute,
+ unless there is a good reason not to do so.
+
+ The language attribute, LANG, takes as its value a language tag that
+ identifies a natural language spoken, written, or otherwise conveyed
+ by human beings for communication of information to other human
+ beings. Computer languages are explicitly excluded.
+
+
+
+
+
+
+Yergeau, et. al. Standards Track [Page 8]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ The syntax and registry of HTML language tags is the same as that
+ defined by RFC 1766 [RFC1766]. In summary, a language tag is composed
+ of one or more parts: A primary language tag and a possibly empty
+ series of subtags:
+
+ language-tag = primary-tag *( "-" subtag )
+ primary-tag = 1*8ALPHA
+ subtag = 1*8ALPHA
+
+ Whitespace is not allowed within the tag and all tags are case-
+ insensitive. The namespace of language tags is administered by the
+ IANA. Example tags include:
+
+ en, en-US, en-cockney, i-cherokee, x-pig-latin
+
+ In the context of HTML, a language tag is not to be interpreted as a
+ single token, as per RFC 1766, but as a hierarchy. For example, a
+ user agent that adjusts rendering according to language should
+ consider that it has a match when a language tag in a style sheet
+ entry matches the initial portion of the language tag of an element.
+ An exact match should be preferred. This interpretation allows an
+ element marked up as, for instance, "en-US" to trigger styles
+ corresponding to, in order of preference, US-English ("en-US") or
+ 'plain' or 'international' English ("en").
+
+ NOTE -- using the language tag as a hierarchy does not imply that
+ all languages with a common prefix will be understood by those
+ fluent in one or more of those languages; it simply allows the
+ user to request this commonality when it is true for that user.
+
+ The rendering of elements may be affected by the LANG attribute. For
+ any element, the value of the LANG attribute overrides the value
+ specified by the LANG attribute of any enclosing element and the
+ value (if any) of the HTTP Content-Language header. If none of these
+ are set, a suitable default, perhaps controlled by user preferences,
+ by automatic context analysis or by the user's locale, should be used
+ to control rendering.
+
+4. Additional entities, attributes and elements
+
+4.1. Full Latin-1 entity set
+
+ According to the suggestion of section 14 of [RFC1866], the set of
+ Latin-1 entities is extended to cover the whole right part of ISO-
+ 8859-1 (all code positions with the high-order bit set), including
+ the already commonly used  , © and ®. The names of the
+ entities are taken from the appendices of SGML [ISO-8879]. A list is
+ provided in section 7.3 of this specification.
+
+
+
+Yergeau, et. al. Standards Track [Page 9]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+4.2. Markup for language-dependent presentation
+
+4.2.1. Overview
+
+ For the correct presentation of text in certain languages
+ (irrespective of formatting issues), some support in the form of
+ additional entities and elements is needed.
+
+ In particular, the following features are dealt with:
+
+ - Markup of bidirectional text, i.e. text where left-to-right and
+ right-to-left scripts are mixed.
+
+ - Control of cursive joining behaviour in contexts where the
+ default behaviour is not appropriate.
+
+ - Language-dependent rendering of short (in-line) quotations.
+
+ - Better justification control for languages where this is
+ important.
+
+ - Superscripts and subscripts for languages where they appear as
+ part of general text.
+
+ Some of the above features need very little additional support;
+ others need more. The additional features are introduced below with
+ brief comments only. Explanations on cursive joining behaviour and
+ bidirectional text follow later. For cursive joining behaviour and
+ bidirectional text, this document follows [UNICODE] in that: i)
+ character semantics, where applicable, are identical to [UNICODE],
+ and ii) where functionality is moved to HTML as a higher level
+ protocol, this is done in a way that allows straightforward
+ conversion to the lower-level mechanisms defined in [UNICODE].
+
+4.2.2. List of entities, elements, and attributes
+
+ First, a generic container is needed to carry the LANG and DIR (see
+ below) attributes in cases where no other element is appropriate; the
+ SPAN element is introduced for that purpose.
+
+
+
+
+
+
+
+
+
+
+
+
+Yergeau, et. al. Standards Track [Page 10]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ A set of named character entities is added for use with bidirectional
+ rendering and cursive joining control:
+
+ <!ENTITY zwnj CDATA "&#8204;"--=zero width non-joiner-->
+ <!ENTITY zwj CDATA "&#8205;"--=zero width joiner-->
+ <!ENTITY lrm CDATA "&#8206;"--=left-to-right mark-->
+ <!ENTITY rlm CDATA "&#8207;"--=right-to-left mark-->
+
+ These entities can be used in place of the corresponding formatting
+ characters whenever convenient, for example to ease keyboard entry or
+ when a formatting character is not available in the character
+ encoding of the document.
+
+ Next, an attribute called DIR is introduced, restricted to the values
+ LTR (left-to-right) and RTL (right-to-left), for the indication of
+ directionality in the context of bidirectional text (see 4.2.4 below
+ for details). Since any text and many other elements (e.g. tables)
+ can logically be assigned a directionality, all elements except BR,
+ HR, BASE, NEXTID, and META admit this attribute. The DTD reflects
+ this. It is also intended that any new element introduced in later
+ versions of HTML will admit the DIR attribute, unless there is a good
+ reason not to do so.
+
+ A new phrase-level element called BDO (BIDI Override) is introduced,
+ which requires the DIR attribute to specify whether the override is
+ left-to-right or right-to-left. This element is required for
+ bidirectional text control; for detailed explanations, see section
+ 4.2.4.
+
+ The phrase-level element Q is introduced to allow language-dependent
+ rendering of short quotations depending on language and platform
+ capability. As the following examples show (rather poorly, because of
+ the character set restriction of Internet specifications), the
+ quotation marks surrounding the quotation are particularly affected:
+ "a quotation in English", `another, slightly better one', ,,a
+ quotation in German'', << a quotation in French >>. The contents of
+ the Q element does not include quotation marks, which have to be
+ added by the rendering process.
+
+ NOTE -- Q elements can be nested. Many languages use different
+ quotation styles for outer and inner quotations, and this should
+ be respected by user-agents implementing this element.
+
+
+
+
+
+
+
+
+
+Yergeau, et. al. Standards Track [Page 11]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ NOTE -- minimal support for the Q element is to surround the
+ contents with some kind of quotes, like the plain ASCII double
+ quotes. As this is rather easy to implement, and as the lack of
+ any visible quotes may affect the perceived meaning of the text,
+ user-agent implementors are strongly requested to provide at least
+ this minimal level of support.
+
+ Many languages require superscript text for proper rendering: as an
+ example, the French "Mlle Dupont" should have "lle" in superscript.
+ The SUP element, and its sibling SUB for subscript text, are
+ introduced to allow proper markup of such text. SUP and SUB contents
+ are restricted to PCDATA to avoid nesting problems.
+
+ Finally, in many languages text justification is much more important
+ than it is in Western languages, and justifies markup. The ALIGN
+ attribute, admitting values of LEFT, RIGHT, CENTER and JUSTIFY, is
+ added to a selection of elements where it makes sense (the block-like
+ P, HR, H1 to H6, OL, UL, DIR, MENU, LI, BLOCKQUOTE and ADDRESS). If
+ a user-agent chooses to have LEFT as a default for blocks of left-
+ to-right directionality, it should use RIGHT for blocks of right-to-
+ left directionality.
+
+ NOTE -- RFC 1866 section 4.2.2 specifies that an HTML user agent
+ should treat an end of line as a word space, except in
+ preformatted text. This should be interpreted in the context of
+ the script being processed, as the way words are separated in
+ writing is script-dependent. For some scripts (e.g. Latin), a
+ word space is just a space, but in other scripts (e.g. Thai) it is
+ a zero-width word separator, whereas in yet other scripts (e.g.
+ Japanese) it is nothing at all, i.e. totally ignored.
+
+ NOTE -- the SOFT HYPHEN character (U+00AD) needs special attention
+ from user-agent implementers. It is present in many character
+ sets (including the whole ISO 8859 series and, of course, ISO
+ 10646), and can always be included by means of the reference
+ &shy;. Its semantics are different from the plain HYPHEN: it
+ indicates a point in a word where a line break is allowed. If the
+ line is indeed broken there, a hyphen must be displayed at the end
+ of the first line. If not, the character is not dispalyed at all.
+ In operations like searching and sorting, it must always be
+ ignored.
+
+
+
+
+
+
+
+
+
+
+Yergeau, et. al. Standards Track [Page 12]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ In the DTD, the LANG and DIR attributes are grouped together in a
+ parameter entity called attrs. To parallel RFC 1942 [RFC1942], the
+ ID and CLASS attributes are also included in attrs. The ID and CLASS
+ attributes are required for use with style sheets, and RFC 1942
+ defines them as follows:
+
+ID Used to define a document-wide identifier. This can be used
+ for naming positions within documents as the destination of a
+ hypertext link. It may also be used by style sheets for
+ rendering an element in a unique style. An ID attribute value is
+ an SGML NAME token. NAME tokens are formed by an initial
+ letter followed by letters, digits, "-" and "." characters. The
+ letters are restricted to A-Z and a-z.
+
+CLASS A space separated list of SGML NAME tokens. CLASS names
+ specify that the element belongs to the corresponding named
+ classes. It allows authors to distinguish different roles
+ played by the same tag. The classes may be used by style
+ sheets to provide different renderings as appropriate to
+ these roles.
+
+4.2.3. Cursive joining behaviour
+
+ Markup is needed in some cases to force cursive joining behavior in
+ contexts in which it would not normally occur, or to block it when it
+ would normally occur.
+
+ The zero-width joiner and non-joiner (&zwj; and &zwnj;) are used to
+ control cursive joining behaviour. For example, ARABIC LETTER HEH is
+ used in isolation to abbreviate "Hijri" (the Islamic calendrical
+ system); however, the initial form of the letter is desired, because
+ the isolated form of HEH looks like the digit five as employed in
+ Arabic script. This is obtained by following the HEH with a zero-
+ width joiner whose only effect is to provide context. In Persian
+ texts, there are cases where a letter that normally would join a
+ subsequent letter in a cursive connection does not. Here a zero-
+ width non- joiner is used.
+
+4.2.4. Bidirectional text
+
+ Many languages are written in horizontal lines from left to right,
+ while others are written from right to left. When both writing
+ directions are present, one talks of bidirectional text (BIDI for
+ short). BIDI text requires markup in special circumstances where
+ ambiguities as to the directionality of some characters have to be
+ resolved. This markup affects the ability to render BIDI text in a
+ semantically legible fashion. That is, without this special BIDI
+ markup, cases arise which would prevent *any* rendering whatsoever
+
+
+
+Yergeau, et. al. Standards Track [Page 13]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ that reflected the basic meaning of the text. Plain text may contain
+ BIDI markup in the form of special-purpose formatting characters.
+
+ This is also possible in HTML, which includes the five BIDI-related
+ formatting characters (202A - 202E) of ISO 10646. As an alternative,
+ HTML provides equivalent SGML markup.
+
+ BIDI is a complex issue, and conversion of logical text sequences to
+ display sequences has to be done according to the algorithm and
+ character properties specified in [UNICODE]. Here, explanations are
+ given only as far as they are needed to understand the necessity of
+ the features introduced and to define their exact semantics.
+
+ The Unicode BIDI algorithm is based on the individual characters of a
+ text being stored in logical order, that is the order in which they
+ are normally input and in which the corresponding sounds are normally
+ spoken. To make rendering of logical order text possible, the
+ algorithm assigns a directionality property to each character, e.g.
+ Latin letters are specified to have a left-to-right direction, Arabic
+ and Hebrew characters have a right-to-left direction.
+
+ The left-to-right and right-to-left marks (&lrm; and &rlm;) are used
+ to disambiguate directionality of neutral characters. For example,
+ when a double quote sits between an Arabic and a Latin letter, its
+ direction is ambiguous; if a directional mark is added on one side
+ such that the quotation mark is surrounded by characters of only one
+ directionality, the ambiguity is removed. These characters are like
+ zero width spaces which have a directional property (but no word/line
+ break property).
+
+ Nested embeddings of contra-directional text runs, due to nested
+ quotations or to the pasting of text from one BIDI context to
+ another, is also a case where the implicit directionality of
+ characters is not sufficient, requiring markup. Also, it is
+ frequently desirable to specify the basic directionality of a block
+ of text. For these purposes, the DIR attribute is used.
+
+ On block-type elements, the DIR attribute indicates the base
+ directionality of the text in the block; if omitted it is inherited
+ from the parent element. The default directionality of the overall
+ HTML document is left-to-right.
+
+ On inline elements, it makes the element start a new embedding level
+ (to be explained below); if omitted the inline element does not start
+ a new embedding level.
+
+
+
+
+
+
+Yergeau, et. al. Standards Track [Page 14]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ NOTE -- the PRE, XMP and LISTING elements admit the DIR attribute.
+ Their contents should not be considered as preformatted with
+ respect to bidirectional layout, but the BIDI algorithm should be
+ applied to each line of text.
+
+ Following is an example of a case where embedding is needed, showing
+ its effect:
+
+ Given the following latin (upper case) and arabic (lower case)
+ letters in backing store with the specified embeddings:
+
+ <SPAN DIR=LTR> AB <SPAN DIR=RTL> xy <SPAN DIR=LTR> CD </SPAN> zw
+ </SPAN> EF </SPAN>
+
+ One gets the following rendering (with [] showing the directional
+ transitions):
+
+ [ AB [ wz [ CD ] yx ] EF ]
+
+ On the other hand, without this markup and with a base direction
+ of LTR one gets the following rendering:
+
+ [ AB [ yx ] CD [ wz ] EF ]
+
+ Notice that yx is on the left and wz on the right unlike the above
+ case where the embedding levels are used. Without the embedding
+ markup one has at most two levels: a base directional level and a
+ single counterflow directional level.
+
+ The DIR attribute on inline elements is equivalent to the formatting
+ characters LEFT-TO-RIGHT EMBEDDING (202A) and RIGHT-TO-LEFT
+ EMBEDDING (202B) of ISO 10646. The end tag of the element is
+ equivalent to the POP DIRECTIONAL FORMATTING (202C) character.
+
+ Directional override, as provided by the BDO element, is needed to
+ deal with unusual short pieces of text in which directionality cannot
+ be resolved from context in an unambiguous fashion. For example, it
+ can be used to force left-to-right (or right-to-left) display of part
+ numbers composed of Latin letters, digits and Hebrew letters.
+
+ The effect of BDO is to force the directionality of all characters
+ within it to the value of DIR, irrespective of their intrinsic
+ directional properties. It is equivalent to using the LEFT-TO-RIGHT
+ OVERRIDE (202D) or RIGHT-TO-LEFT OVERRIDE (202E) characters of ISO
+ 10646, the end tag again being equivalent to the POP DIRECTIONAL
+ FORMATTING (202C) character.
+
+
+
+
+
+Yergeau, et. al. Standards Track [Page 15]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ NOTE -- authors and authoring software writers should be aware
+ that conflicts can arise if the DIR attribute is used on inline
+ elements (including BDO) concurrently with the use of the
+ corresponding ISO 10646 formatting characters.
+
+ Preferably one or the other should be used exclusively; the markup
+ method is better able to guarantee document structural integrity,
+ and alleviates some problems when editing bidirectional HTML text
+ with a simple text editor, but some software may be more apt at
+ using the 10646 characters. If both methods are used, great care
+ should be exercised to insure proper nesting of markup and
+ directional embedding or override; otherwise, rendering results
+ are undefined.
+
+5. Forms
+
+5.1. DTD additions
+
+ It is natural to expect input in any language in forms, as they
+ provide one of the only ways of obtaining user input. While this is
+ primarily a UI issue, there are some things that should be specified
+ at the HTML level to guide behavior and promote interoperability.
+
+ To ensure full interoperability, it is necessary for the user agent
+ (and the user) to have an indication of the character encoding(s)
+ that the server providing a form will be able to handle upon
+ submission of the filled-in form. Such an indication is provided by
+ the ACCEPT-CHARSET attribute of the INPUT and TEXTAREA elements,
+ modeled on the HTTP Accept-Charset header (see [HTTP-1.1]), which
+ contains a space and/or comma delimited list of character sets
+ acceptable to the server. A user agent may want to somehow advise
+ the user of the contents of this attribute, or to restrict his
+ possibility to enter characters outside the repertoires of the listed
+ character sets.
+
+ NOTE -- The list of character sets is to be interpreted as an
+ EXCLUSIVE-OR list; the server announces that it is ready to accept
+ any ONE of these character encoding schemes for each part of a
+ multipart entity. The client may perform character encoding
+ translation to satisfy the server if necessary.
+
+ NOTE -- The default value for the ACCEPT-CHARSET attribute of an
+ INPUT or TEXTAREA element is the reserved value "UNKNOWN". A user
+ agent may interpret that value as the character encoding scheme
+ that was used to transmit the document containing that element.
+
+
+
+
+
+
+Yergeau, et. al. Standards Track [Page 16]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+5.2. Form submission
+
+ The HTML 2.0 form submission mechanism, based on the "application/x-
+ www-form-urlencoded" media type, is ill-equipped with regard to
+ internationalization. In fact, since URLs are restricted to ASCII
+ characters, the mechanism is akward even for ISO-8859-1 text.
+ Section 2.2 of [RFC1738] specifies that octets may be encoded using
+ the "%HH" notation, but text submitted from a form is composed of
+ characters, not octets. Lacking a specification of a character
+ encoding scheme, the "%HH" notation has no well-defined meaning.
+
+ The best solution is to use the "multipart/form-data" media type
+ described in [RFC1867] with the POST method of form submission. This
+ mechanism encapsulates the value part of each name-value pair in a
+ body-part of a multipart MIME body that is sent as the HTTP entity;
+ each body part can be labeled with an appropriate Content-Type,
+ including if necessary a charset parameter that specifies the
+ character encoding scheme. The changes to the DTD necessary to
+ support this method of form submission have been incorporated in the
+ DTD included in this specification.
+
+ A less satisfactory solution is to add a MIME charset parameter to
+ the "application/x-www-form-urlencoded" media type specifier sent
+ along with a POST method form submission, with the understanding that
+ the URL encoding of [RFC1738] is applied on top of the specified
+ character encoding, as a kind of implicit Content-Transfer-Encoding.
+
+ One problem with both solutions above is that current browsers do not
+ generally allow for bookmarks to specify the POST method; this should
+ be improved. Conversely, the GET method could be used with the form
+ data transmitted in the body instead of in the URL. Nothing in the
+ protocol seems to prevent it, but no implementations appear to exist
+ at present.
+
+ How the user agent determines the encoding of the text entered by the
+ user is outside the scope of this specification.
+
+ NOTE -- Designers of forms and their handling scripts should be
+ aware of an important caveat: when the default value of a field
+ (the VALUE attribute) is returned upon form submission (i.e. the
+ user did not modify this value), it cannot be guaranteed to be
+ transmitted as a sequence of octets identical to that in the
+ source document -- only as a possibly different but valid encoding
+ of the same sequence of text elements. This may be true even if
+ the encoding of the document containing the form and that used for
+ submission are the same.
+
+
+
+
+
+Yergeau, et. al. Standards Track [Page 17]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ Differences can occur when a sequence of characters can be
+ represented by various sequences of octets, and also when a
+ composite sequence (a base character plus one or more combining
+ diacritics) can be represented by either a different but
+ equivalent composite sequence or by a fully precomposed character.
+ For instance, the UCS-2 sequence 00EA+0323 (LATIN SMALL LETTER E
+ WITH CIRCUMFLEX ACCENT + COMBINING DOT BELOW) may be transformed
+ into 1EC7 (LATIN SMALL LETTER E WITH CIRCUMFLEX ACCENT AND DOT
+ BELOW), into 0065+0302+0323 (LATIN SMALL LETTER E + COMBINING
+ CIRCUMFLEX ACCENT + COMBINING DOT BELOW), as well as into other
+ equivalent composite sequences.
+
+6. External character encoding issues
+
+ Proper interpretation of a text document requires that the character
+ encoding scheme be known. Current HTTP servers, however, do not
+ generally include an appropriate charset parameter with the Content-
+ Type header. This is bad behaviour, which is even encouraged by the
+ continued existence of browsers that declare an unrecognized media
+ type when they receive a charset parameter. User agent
+ implementators are strongly encouraged to make their software
+ tolerant of this parameter, even if they cannot take advantage of it.
+ Proper labelling is highly desirable, but some preventive measures
+ can be taken to minimize the detrimental effects of its absence:
+
+ In the case where a document is accessed from a hyperlink in an
+ origin HTML document, a CHARSET attribute is added to the attribute
+ list of elements with link semantics (A and LINK), specifically by
+ adding it to the linkExtraAttributes entity. The value of that
+ attribute is to be considered a hint to the User Agent as to the
+ character encoding scheme used by the resource pointed to by the
+ hyperlink; it should be the appropriate value of the MIME charset
+ parameter for that resource.
+
+ In any document, it is possible to include an indication of the
+ encoding scheme like the following, as early as possible within the
+ HEAD of the document:
+
+ <META HTTP-EQUIV="Content-Type"
+ CONTENT="text/html; charset=ISO-2022-JP">
+
+ This is not foolproof, but will work if the encoding scheme is such
+ that ASCII-valued octets stand for ASCII characters only at least
+ until the META element is parsed. Note that there are better ways
+ for a server to obtain character encoding information, instead of the
+ unreliable META above; see [NICOL2] for some details and a proposal.
+
+
+
+
+
+Yergeau, et. al. Standards Track [Page 18]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ For definiteness, the "charset" parameter received from the source of
+ the document should be considered the most authoritative, followed in
+ order of preference by the contents of a META element such as the
+ above, and finally the CHARSET parameter of the anchor that was
+ followed (if any).
+
+ When HTML text is transmitted directly in UCS-2 or UCS-4 form, the
+ question of byte order arises: does the high-order byte of each
+ multi-byte character come first or last? For definiteness, this
+ specification recommends that UCS-2 and UCS-4 be transmitted in big-
+ endian byte order (high order byte first), which corresponds to the
+ established network byte order for two- and four-byte quantities, to
+ the ISO 10646 requirement and Unicode recommendation for serialized
+ text data and to RFC 1641. Furthermore, to maximize chances of
+ proper interpretation, it is recommended that documents transmitted
+ as UCS-2 or UCS-4 always begin with a ZERO-WIDTH NON-BREAKING SPACE
+ character (hexadecimal FEFF or 0000FEFF) which, when byte-reversed
+ becomes number FFFE or FFFE0000, a character guaranteed to be never
+ assigned. Thus, a user-agent receiving an FFFE as the first octets
+ of a text would know that bytes have to be reversed for the remainder
+ of the text.
+
+ There exist so-called UCS Transformation Formats than can be used to
+ transmit UCS data, in addition to UCS-2 and UCS-4. UTF-7 [RFC1642]
+ and UTF-8 [UTF-8] have favorable properties (no byte-ordering
+ problem, different flavours of ASCII compatibility) that make them
+ worthy of consideration, especially for transmission of multilingual
+ text. Another encoding scheme, MNEM [RFC1345], also has interesting
+ properties and the capability to transmit the full UCS. The UTF-1
+ transformation format of ISO 10646:1993 (registered by IANA as ISO-
+ 10646-UTF-1), has been removed from ISO 10646 by amendment 4, and
+ should not be used.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Yergeau, et. al. Standards Track [Page 19]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+7. HTML Public Text
+
+7.1. HTML DTD
+
+ This section contains a DTD for HTML based on the HTML 2.0 DTD of RFC
+ 1866, incorporating the changes for file upload as specified in RFC
+ 1867, and the changes deriving from this document.
+
+ <!-- html.dtd
+
+ Document Type Definition for the HyperText Markup Language,
+ extended for internationalisation (HTML DTD)
+
+ Last revised: 96/08/07
+
+ Authors: Daniel W. Connolly <connolly@w3.org>
+ Francois Yergeau <yergeau@alis.com>
+ See Also:
+ http://www.w3.org/hypertext/WWW/MarkUp/MarkUp.html
+ -->
+
+ <!ENTITY % HTML.Version
+ "-//IETF//DTD HTML i18n//EN"
+
+ -- Typical usage:
+
+ <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML i18n//EN">
+ <html>
+ ...
+ </html>
+ --
+ >
+
+
+ <!--============ Feature Test Entities ========================-->
+
+ <!ENTITY % HTML.Recommended "IGNORE"
+ -- Certain features of the language are necessary for
+ compatibility with widespread usage, but they may
+ compromise the structural integrity of a document.
+ This feature test entity enables a more prescriptive
+ document type definition that eliminates
+ those features.
+ -->
+
+ <![ %HTML.Recommended [
+ <!ENTITY % HTML.Deprecated "IGNORE">
+ ]]>
+
+
+
+Yergeau, et. al. Standards Track [Page 20]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ <!ENTITY % HTML.Deprecated "INCLUDE"
+ -- Certain features of the language are necessary for
+ compatibility with earlier versions of the specification,
+ but they tend to be used and implemented inconsistently,
+ and their use is deprecated. This feature test entity
+ enables a document type definition that eliminates
+ these features.
+ -->
+
+ <!ENTITY % HTML.Highlighting "INCLUDE"
+ -- Use this feature test entity to validate that a
+ document uses no highlighting tags, which may be
+ ignored on minimal implementations.
+ -->
+
+ <!ENTITY % HTML.Forms "INCLUDE"
+ -- Use this feature test entity to validate that a document
+ contains no forms, which may not be supported in minimal
+ implementations
+ -->
+
+ <!--============== Imported Names ==============================-->
+
+ <!ENTITY % Content-Type "CDATA"
+ -- meaning an internet media type
+ (aka MIME content type, as per RFC2045)
+ -->
+
+ <!ENTITY % HTTP-Method "GET | POST"
+ -- as per HTTP specification, RFC2068
+ -->
+
+ <!--========= DTD "Macros" =====================-->
+
+ <!ENTITY % heading "H1|H2|H3|H4|H5|H6">
+
+ <!ENTITY % list " UL | OL | DIR | MENU " >
+
+ <!ENTITY % attrs -- common attributes for elements --
+ "LANG NAME #IMPLIED -- RFC 1766 language tag --
+ DIR (ltr|rtl) #IMPLIED -- text directionnality --
+ ID ID #IMPLIED -- element identifier
+ (from RFC1942) --
+ CLASS NAMES #IMPLIED -- for subclassing elements
+ (from RFC1942) --">
+
+ <!ENTITY % just -- an attribute for text justification --
+ "ALIGN (left|right|center|justify) #IMPLIED"
+
+
+
+Yergeau, et. al. Standards Track [Page 21]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ -- default is left for ltr paragraphs, right for rtl -- >
+
+ <!--======= Character mnemonic entities =================-->
+
+ <!ENTITY % ISOlat1 PUBLIC
+ "ISO 8879-1986//ENTITIES Added Latin 1//EN//HTML">
+ %ISOlat1;
+
+ <!ENTITY amp CDATA "&#38;" -- ampersand -->
+ <!ENTITY gt CDATA "&#62;" -- greater than -->
+ <!ENTITY lt CDATA "&#60;" -- less than -->
+ <!ENTITY quot CDATA "&#34;" -- double quote -->
+
+ <!--Entities for language-dependent presentation (BIDI and
+ contextual analysis) -->
+ <!ENTITY zwnj CDATA "&#8204;"-- zero width non-joiner-->
+ <!ENTITY zwj CDATA "&#8205;"-- zero width joiner-->
+ <!ENTITY lrm CDATA "&#8206;"-- left-to-right mark-->
+ <!ENTITY rlm CDATA "&#8207;"-- right-to-left mark-->
+
+
+ <!--========= SGML Document Access (SDA) Parameter Entities =====-->
+
+ <!-- HTML contains SGML Document Access (SDA) fixed attributes
+ in support of easy transformation to the International Committee
+ for Accessible Document Design (ICADD) DTD
+ "-//EC-USA-CDA/ICADD//DTD ICADD22//EN".
+ ICADD applications are designed to support usable access to
+ structured information by print-impaired individuals through
+ Braille, large print and voice synthesis. For more information on
+ SDA & ICADD:
+ - ISO 12083:1993, Annex A.8, Facilities for Braille,
+ large print and computer voice
+ - ICADD ListServ
+ <ICADD%ASUACAD.BITNET@ARIZVM1.ccit.arizona.edu>
+ - Usenet news group bit.listserv.easi
+ - Recording for the Blind, +1 800 221 4792
+ -->
+
+ <!ENTITY % SDAFORM "SDAFORM CDATA #FIXED"
+ -- one to one mapping -->
+ <!ENTITY % SDARULE "SDARULE CDATA #FIXED"
+ -- context-sensitive mapping -->
+ <!ENTITY % SDAPREF "SDAPREF CDATA #FIXED"
+ -- generated text prefix -->
+ <!ENTITY % SDASUFF "SDASUFF CDATA #FIXED"
+ -- generated text suffix -->
+ <!ENTITY % SDASUSP "SDASUSP NAME #FIXED"
+
+
+
+Yergeau, et. al. Standards Track [Page 22]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ -- suspend transform process -->
+
+
+ <!--========== Text Markup =====================-->
+
+ <![ %HTML.Highlighting [
+
+ <!ENTITY % font " TT | B | I ">
+
+ <!ENTITY % phrase "EM | STRONG | CODE | SAMP | KBD | VAR | CITE ">
+
+ <!ENTITY % text "#PCDATA|A|IMG|BR|%phrase|%font|SPAN|Q|BDO|SUP|SUB">
+
+ <!ELEMENT (%font;|%phrase) - - (%text)*>
+ <!ATTLIST ( TT | CODE | SAMP | KBD | VAR )
+ %attrs;
+ %SDAFORM; "Lit"
+ >
+
+ <!ATTLIST ( B | STRONG )
+ %attrs;
+ %SDAFORM; "B"
+ >
+ <!ATTLIST ( I | EM | CITE )
+ %attrs;
+ %SDAFORM; "It"
+ >
+
+ <!-- <TT> Typewriter text -->
+ <!-- <B> Bold text -->
+ <!-- <I> Italic text -->
+
+ <!-- <EM> Emphasized phrase -->
+ <!-- <STRONG> Strong emphasis -->
+ <!-- <CODE> Source code phrase -->
+ <!-- <SAMP> Sample text or characters -->
+ <!-- <KBD> Keyboard phrase, e.g. user input -->
+ <!-- <VAR> Variable phrase or substitutable -->
+ <!-- <CITE> Name or title of cited work -->
+
+ <!ENTITY % pre.content "#PCDATA|A|HR|BR|%font|%phrase|SPAN|BDO">
+
+ ]]>
+
+ <!ENTITY % text "#PCDATA|A|IMG|BR|SPAN|Q|BDO|SUP|SUB">
+
+ <!ELEMENT BR - O EMPTY>
+ <!ATTLIST BR
+
+
+
+Yergeau, et. al. Standards Track [Page 23]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ %SDAPREF; "&#RE;"
+ >
+
+ <!-- <BR> Line break -->
+
+ <!ELEMENT SPAN - - (%text)*>
+ <!ATTLIST SPAN
+ %attrs;
+ %SDAFORM; "other #Attlist"
+ >
+
+ <!-- <SPAN> Generic inline container -->
+ <!-- <SPAN DIR=...> New counterflow embedding -->
+ <!-- <SPAN LANG="..."> Language of contents -->
+
+ <!ELEMENT Q - - (%text)*>
+ <!ATTLIST Q
+ %attrs;
+ %SDAPREF; '"'
+ %SDASUFF; '"'
+ >
+
+ <!-- <Q> Short quotation -->
+ <!-- <Q LANG=xx> Language of quotation is xx -->
+ <!-- <Q DIR=...> New conterflow embedding -->
+
+ <!ELEMENT BDO - - (%text)+>
+ <!ATTLIST BDO
+ LANG NAME #IMPLIED
+ DIR (ltr|rtl) #REQUIRED
+ ID ID #IMPLIED
+ CLASS NAMES #IMPLIED
+ %SDAPREF "Bidi Override #Attval(DIR): "
+ %SDASUFF "End Bidi"
+ >
+
+ <!-- <BDO DIR=...> Override directionality of text to value of DIR -->
+ <!-- <BDO LANG=...> Language of contents -->
+
+ <!ELEMENT (SUP|SUB) - - (#PCDATA)>
+ <!ATTLIST (SUP)
+ %attrs;
+ %SDAPREF "Superscript(#content)"
+ >
+ <!ATTLIST (SUB)
+ %attrs;
+ %SDAPREF "Subscript(#content)"
+ >
+
+
+
+Yergeau, et. al. Standards Track [Page 24]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ <!-- <SUP> Superscript -->
+ <!-- <SUB> Subscript -->
+
+ <!--========= Link Markup ======================-->
+
+ <!ENTITY % linkType "NAMES">
+
+ <!ENTITY % linkExtraAttributes
+ "REL %linkType #IMPLIED
+ REV %linkType #IMPLIED
+ URN CDATA #IMPLIED
+ TITLE CDATA #IMPLIED
+ METHODS NAMES #IMPLIED
+ CHARSET NAME #IMPLIED
+ ">
+
+ <![ %HTML.Recommended [
+ <!ENTITY % A.content "(%text)*"
+
+ -- <H1><a name="xxx">Heading</a></H1>
+ is preferred to
+ <a name="xxx"><H1>Heading</H1></a>
+ -->
+ ]]>
+
+ <!ENTITY % A.content "(%heading|%text)*">
+
+ <!ELEMENT A - - %A.content -(A)>
+ <!ATTLIST A
+ %attrs;
+ HREF CDATA #IMPLIED
+ NAME CDATA #IMPLIED
+ %linkExtraAttributes;
+ %SDAPREF; "<Anchor: #AttList>"
+ >
+ <!-- <A> Anchor; source/destination of link -->
+ <!-- <A NAME="..."> Name of this anchor -->
+ <!-- <A HREF="..."> Address of link destination -->
+ <!-- <A URN="..."> Permanent address of destination -->
+ <!-- <A REL=...> Relationship to destination -->
+ <!-- <A REV=...> Relationship of destination to this -->
+ <!-- <A TITLE="..."> Title of destination (advisory) -->
+ <!-- <A METHODS="..."> Operations on destination (advisory) -->
+ <!-- <A CHARSET="..."> Charset of destination (advisory) -->
+ <!-- <A LANG="..."> Language of contents btw <A> and </A> -->
+ <!-- <A DIR=...> Contents is a new counterflow embedding -->
+
+ <!--========== Images ==========================-->
+
+
+
+Yergeau, et. al. Standards Track [Page 25]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ <!ELEMENT IMG - O EMPTY>
+ <!ATTLIST IMG
+ %attrs;
+ SRC CDATA #REQUIRED
+ ALT CDATA #IMPLIED
+ ALIGN (top|middle|bottom) #IMPLIED
+ ISMAP (ISMAP) #IMPLIED
+ %SDAPREF; "<Fig><?SDATrans Img: #AttList>#AttVal(Alt)</Fig>"
+ >
+
+ <!-- <IMG> Image; icon, glyph or illustration -->
+ <!-- <IMG SRC="..."> Address of image object -->
+ <!-- <IMG ALT="..."> Textual alternative -->
+ <!-- <IMG ALIGN=...> Position relative to text -->
+ <!-- <IMG LANG=...> Image contains "text" in that language -->
+ <!-- <IMG DIR=...> Inline image acts as a RTL or LTR
+ embedding w/r to BIDI algorithm -->
+ <!-- <IMG ISMAP> Each pixel can be a link -->
+
+ <!--========== Paragraphs=======================-->
+
+ <!ELEMENT P - O (%text)*>
+ <!ATTLIST P
+ %attrs;
+ %just;
+ %SDAFORM; "Para"
+ >
+
+ <!-- <P> Paragraph -->
+ <!-- <P LANG="..."> Language of paragraph text -->
+ <!-- <P DIR=...> Base directionality of paragraph -->
+ <!-- <P ALIGN=...> Paragraph alignment (justification) -->
+
+ <!--========== Headings, Titles, Sections ===============-->
+
+ <!ELEMENT HR - O EMPTY>
+ <!ATTLIST HR
+ %just;
+ %SDAPREF; "&#RE;&#RE;"
+ >
+
+ <!-- <HR> Horizontal rule -->
+
+ <!ELEMENT ( %heading ) - - (%text;)*>
+ <!ATTLIST H1
+ %attrs;
+ %just;
+ %SDAFORM; "H1"
+
+
+
+Yergeau, et. al. Standards Track [Page 26]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ >
+ <!ATTLIST H2
+ %attrs;
+ %just;
+ %SDAFORM; "H2"
+ >
+ <!ATTLIST H3
+ %attrs;
+ %just;
+ %SDAFORM; "H3"
+ >
+ <!ATTLIST H4
+ %attrs;
+ %just;
+ %SDAFORM; "H4"
+ >
+ <!ATTLIST H5
+ %attrs;
+ %just;
+ %SDAFORM; "H5"
+ >
+ <!ATTLIST H6
+ %attrs;
+ %just;
+ %SDAFORM; "H6"
+ >
+
+ <!-- <H1> Heading, level 1 -->
+ <!-- <H2> Heading, level 2 -->
+ <!-- <H3> Heading, level 3 -->
+ <!-- <H4> Heading, level 4 -->
+ <!-- <H5> Heading, level 5 -->
+ <!-- <H6> Heading, level 6 -->
+
+
+ <!--========== Text Flows ======================-->
+
+ <![ %HTML.Forms [
+ <!ENTITY % block.forms "BLOCKQUOTE | FORM | ISINDEX">
+ ]]>
+
+ <!ENTITY % block.forms "BLOCKQUOTE">
+
+ <![ %HTML.Deprecated [
+ <!ENTITY % preformatted "PRE | XMP | LISTING">
+ ]]>
+
+ <!ENTITY % preformatted "PRE">
+
+
+
+Yergeau, et. al. Standards Track [Page 27]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ <!ENTITY % block "P | %list | DL
+ | %preformatted
+ | %block.forms">
+
+ <!ENTITY % flow "(%text|%block)*">
+
+ <!ENTITY % pre.content "#PCDATA | A | HR | BR | SPAN | BDO">
+ <!ELEMENT PRE - - (%pre.content)*>
+ <!ATTLIST PRE
+ %attrs;
+ WIDTH NUMBER #implied
+ %SDAFORM; "Lit"
+ >
+
+ <!-- <PRE> Preformatted text -->
+ <!-- <PRE WIDTH=...> Maximum characters per line -->
+ <!-- <PRE DIR=...> Base direction of preformatted block -->
+ <!-- <PRE LANG=...> Language of contents -->
+
+ <![ %HTML.Deprecated [
+
+ <!ENTITY % literal "CDATA"
+ -- historical, non-conforming parsing mode where
+ the only markup signal is the end tag
+ in full
+ -->
+
+ <!ELEMENT (XMP|LISTING) - - %literal>
+ <!ATTLIST XMP
+ %attrs;
+ %SDAFORM; "Lit"
+ %SDAPREF; "Example:&#RE;"
+ >
+ <!ATTLIST LISTING
+ %attrs;
+ %SDAFORM; "Lit"
+ %SDAPREF; "Listing:&#RE;"
+ >
+
+ <!-- <XMP> Example section -->
+ <!-- <LISTING> Computer listing -->
+
+ <!ELEMENT PLAINTEXT - O %literal>
+ <!-- <PLAINTEXT> Plain text passage -->
+
+ <!ATTLIST PLAINTEXT
+ %attrs;
+ %SDAFORM; "Lit"
+
+
+
+Yergeau, et. al. Standards Track [Page 28]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ >
+ ]]>
+
+
+ <!--========== Lists ==================-->
+
+ <!ELEMENT DL - - (DT | DD)+>
+ <!ATTLIST DL
+ %attrs;
+ COMPACT (COMPACT) #IMPLIED
+ %SDAFORM; "List"
+ %SDAPREF; "Definition List:"
+ >
+
+ <!ELEMENT DT - O (%text)*>
+ <!ATTLIST DT
+ %attrs;
+ %SDAFORM; "Term"
+ >
+
+ <!ELEMENT DD - O %flow>
+ <!ATTLIST DD
+ %attrs;
+ %SDAFORM; "LItem"
+ >
+
+ <!-- <DL> Definition list, or glossary -->
+ <!-- <DL COMPACT> Compact style list -->
+ <!-- <DT> Term in definition list -->
+ <!-- <DD> Definition of term -->
+
+ <!ELEMENT (OL|UL) - - (LI)+>
+ <!ATTLIST OL
+ %attrs;
+ %just;
+ COMPACT (COMPACT) #IMPLIED
+ %SDAFORM; "List"
+ >
+ <!ATTLIST UL
+ %attrs;
+ %just;
+ COMPACT (COMPACT) #IMPLIED
+ %SDAFORM; "List"
+ >
+ <!-- <UL> Unordered list -->
+ <!-- <UL COMPACT> Compact list style -->
+ <!-- <OL> Ordered, or numbered list -->
+ <!-- <OL COMPACT> Compact list style -->
+
+
+
+Yergeau, et. al. Standards Track [Page 29]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ <!ELEMENT (DIR|MENU) - - (LI)+ -(%block)>
+ <!ATTLIST DIR
+ %attrs;
+ %just;
+ COMPACT (COMPACT) #IMPLIED
+ %SDAFORM; "List"
+ %SDAPREF; "<LHead>Directory</LHead>"
+ >
+ <!ATTLIST MENU
+ %attrs;
+ %just;
+ COMPACT (COMPACT) #IMPLIED
+ %SDAFORM; "List"
+ %SDAPREF; "<LHead>Menu</LHead>"
+ >
+
+ <!-- <DIR> Directory list -->
+ <!-- <DIR COMPACT> Compact list style -->
+ <!-- <MENU> Menu list -->
+ <!-- <MENU COMPACT> Compact list style -->
+
+ <!ELEMENT LI - O %flow>
+ <!ATTLIST LI
+ %attrs;
+ %just;
+ %SDAFORM; "LItem"
+ >
+
+ <!-- <LI> List item -->
+
+ <!--========== Document Body ===================-->
+
+ <![ %HTML.Recommended [
+ <!ENTITY % body.content "(%heading|%block|HR|ADDRESS|IMG)*"
+ -- <h1>Heading</h1>
+ <p>Text ...
+ is preferred to
+ <h1>Heading</h1>
+ Text ...
+ -->
+ ]]>
+
+ <!ENTITY % body.content "(%heading | %text | %block |
+ HR | ADDRESS)*">
+
+ <!ELEMENT BODY O O %body.content>
+ <!ATTLIST BODY
+ %attrs;
+
+
+
+Yergeau, et. al. Standards Track [Page 30]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ >
+
+ <!-- <BODY> Document body -->
+ <!-- <BODY DIR=...> Base direction of whole body -->
+ <!-- <BODY LANG=...> Language of contents -->
+
+ <!ELEMENT BLOCKQUOTE - - %body.content>
+ <!ATTLIST BLOCKQUOTE
+ %attrs;
+ %just;
+ %SDAFORM; "BQ"
+ >
+
+ <!-- <BLOCKQUOTE> Quoted passage -->
+
+ <!ELEMENT ADDRESS - - (%text|P)*>
+ <!ATTLIST ADDRESS
+ %attrs;
+ %just;
+ %SDAFORM; "Lit"
+ %SDAPREF; "Address:&#RE;"
+ >
+
+ <!-- <ADDRESS> Address, signature, or byline -->
+
+
+ <!--======= Forms ====================-->
+
+ <![ %HTML.Forms [
+
+ <!ELEMENT FORM - - %body.content -(FORM) +(INPUT|SELECT|TEXTAREA)>
+ <!ATTLIST FORM
+ %attrs;
+ ACTION CDATA #IMPLIED
+ METHOD (%HTTP-Method) GET
+ ENCTYPE %Content-Type; "application/x-www-form-urlencoded"
+ %SDAPREF; "<Para>Form:</Para>"
+ %SDASUFF; "<Para>Form End.</Para>"
+ >
+
+ <!-- <FORM> Fill-out or data-entry form -->
+ <!-- <FORM ACTION="..."> Address for completed form -->
+ <!-- <FORM METHOD=...> Method of submitting form -->
+ <!-- <FORM ENCTYPE="..."> Representation of form data -->
+ <!-- <FORM DIR=...> Base direction of form -->
+ <!-- <FORM LANG=...> Language of contents -->
+
+ <!ENTITY % InputType "(TEXT | PASSWORD | CHECKBOX |
+
+
+
+Yergeau, et. al. Standards Track [Page 31]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ RADIO | SUBMIT | RESET |
+ IMAGE | HIDDEN | FILE )">
+ <!ELEMENT INPUT - O EMPTY>
+ <!ATTLIST INPUT
+ %attrs;
+ TYPE %InputType TEXT
+ NAME CDATA #IMPLIED
+ VALUE CDATA #IMPLIED
+ SRC CDATA #IMPLIED
+ CHECKED (CHECKED) #IMPLIED
+ SIZE CDATA #IMPLIED
+ MAXLENGTH NUMBER #IMPLIED
+ ALIGN (top|middle|bottom) #IMPLIED
+ ACCEPT CDATA #IMPLIED --list of content types --
+ ACCEPT-CHARSET CDATA #IMPLIED --list of charsets accepted --
+ %SDAPREF; "Input: "
+ >
+
+ <!-- <INPUT> Form input datum -->
+ <!-- <INPUT TYPE=...> Type of input interaction -->
+ <!-- <INPUT NAME=...> Name of form datum -->
+ <!-- <INPUT VALUE="..."> Default/initial/selected value -->
+ <!-- <INPUT SRC="..."> Address of image -->
+ <!-- <INPUT CHECKED> Initial state is "on" -->
+ <!-- <INPUT SIZE=...> Field size hint -->
+ <!-- <INPUT MAXLENGTH=...> Data length maximum -->
+ <!-- <INPUT ALIGN=...> Image alignment -->
+ <!-- <INPUT ACCEPT="..."> List of desired media types -->
+ <!-- <INPUT ACCEPT-CHARSET="..."> List of acceptable charsets -->
+
+ <!ELEMENT SELECT - - (OPTION+) -(INPUT|SELECT|TEXTAREA)>
+ <!ATTLIST SELECT
+ %attrs;
+ NAME CDATA #REQUIRED
+ SIZE NUMBER #IMPLIED
+ MULTIPLE (MULTIPLE) #IMPLIED
+ %SDAFORM; "List"
+ %SDAPREF;
+ "<LHead>Select #AttVal(Multiple)</LHead>"
+ >
+
+ <!-- <SELECT> Selection of option(s) -->
+ <!-- <SELECT NAME=...> Name of form datum -->
+ <!-- <SELECT SIZE=...> Options displayed at a time -->
+ <!-- <SELECT MULTIPLE> Multiple selections allowed -->
+
+ <!ELEMENT OPTION - O (#PCDATA)*>
+ <!ATTLIST OPTION
+
+
+
+Yergeau, et. al. Standards Track [Page 32]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ %attrs;
+ SELECTED (SELECTED) #IMPLIED
+ VALUE CDATA #IMPLIED
+ %SDAFORM; "LItem"
+ %SDAPREF;
+ "Option: #AttVal(Value) #AttVal(Selected)"
+ >
+
+ <!-- <OPTION> A selection option -->
+ <!-- <OPTION SELECTED> Initial state -->
+ <!-- <OPTION VALUE="..."> Form datum value for this option-->
+
+ <!ELEMENT TEXTAREA - - (#PCDATA)* -(INPUT|SELECT|TEXTAREA)>
+ <!ATTLIST TEXTAREA
+ %attrs;
+ NAME CDATA #REQUIRED
+ ROWS NUMBER #REQUIRED
+ COLS NUMBER #REQUIRED
+ ACCEPT-CHARSET CDATA #IMPLIED -- list of charsets accepted --
+ %SDAFORM; "Para"
+ %SDAPREF; "Input Text -- #AttVal(Name): "
+ >
+
+ <!-- <TEXTAREA> An area for text input -->
+ <!-- <TEXTAREA NAME=...> Name of form datum -->
+ <!-- <TEXTAREA ROWS=...> Height of area -->
+ <!-- <TEXTAREA COLS=...> Width of area -->
+
+ ]]>
+
+
+ <!--======= Document Head ======================-->
+
+ <![ %HTML.Recommended [
+ <!ENTITY % head.extra "">
+ ]]>
+ <!ENTITY % head.extra "& NEXTID?">
+
+ <!ENTITY % head.content "TITLE & ISINDEX? & BASE? %head.extra">
+
+ <!ELEMENT HEAD O O (%head.content) +(META|LINK)>
+ <!ATTLIST HEAD
+ %attrs; >
+
+ <!-- <HEAD> Document head -->
+
+ <!ELEMENT TITLE - - (#PCDATA)* -(META|LINK)>
+ <!ATTLIST TITLE
+
+
+
+Yergeau, et. al. Standards Track [Page 33]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ %attrs;
+ %SDAFORM; "Ti" >
+
+ <!-- <TITLE> Title of document -->
+
+ <!ELEMENT LINK - O EMPTY>
+ <!ATTLIST LINK
+ %attrs;
+ HREF CDATA #REQUIRED
+ %linkExtraAttributes;
+ %SDAPREF; "Linked to : #AttVal (TITLE) (URN) (HREF)>" >
+
+ <!-- <LINK> Link from this document -->
+ <!-- <LINK HREF="..."> Address of link destination -->
+ <!-- <LINK URN="..."> Lasting name of destination -->
+ <!-- <LINK REL=...> Relationship to destination -->
+ <!-- <LINK REV=...> Relationship of destination to this -->
+ <!-- <LINK TITLE="..."> Title of destination (advisory) -->
+ <!-- <LINK CHARSET="..."> Charset of destination (advisory) -->
+ <!-- <LINK METHODS="..."> Operations allowed (advisory) -->
+
+ <!ELEMENT ISINDEX - O EMPTY>
+ <!ATTLIST ISINDEX
+ %attrs;
+ %SDAPREF;
+ "<Para>[Document is indexed/searchable.]</Para>">
+
+ <!-- <ISINDEX> Document is a searchable index -->
+
+ <!ELEMENT BASE - O EMPTY>
+ <!ATTLIST BASE
+ HREF CDATA #REQUIRED >
+
+ <!-- <BASE> Base context document -->
+ <!-- <BASE HREF="..."> Address for this document -->
+
+ <!ELEMENT NEXTID - O EMPTY>
+ <!ATTLIST NEXTID
+ N CDATA #REQUIRED >
+
+ <!-- <NEXTID> Next ID to use for link name -->
+ <!-- <NEXTID N=...> Next ID to use for link name -->
+
+ <!ELEMENT META - O EMPTY>
+ <!ATTLIST META
+ HTTP-EQUIV NAME #IMPLIED
+ NAME NAME #IMPLIED
+ CONTENT CDATA #REQUIRED >
+
+
+
+Yergeau, et. al. Standards Track [Page 34]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ <!-- <META> Generic Meta-information -->
+ <!-- <META HTTP-EQUIV=...> HTTP response header name -->
+ <!-- <META NAME=...> Meta-information name -->
+ <!-- <META CONTENT="..."> Associated information -->
+
+ <!--======= Document Structure =================-->
+
+ <![ %HTML.Deprecated [
+ <!ENTITY % html.content "HEAD, BODY, PLAINTEXT?">
+ ]]>
+ <!ENTITY % html.content "HEAD, BODY">
+
+ <!ELEMENT HTML O O (%html.content)>
+ <!ENTITY % version.attr "VERSION CDATA #FIXED '%HTML.Version;'">
+
+ <!ATTLIST HTML
+ %attrs;
+ %version.attr;
+ %SDAFORM; "Book"
+ >
+
+ <!-- <HTML> HTML Document -->
+
+7.2. SGML Declaration for HTML
+
+ <!SGML "ISO 8879:1986"
+ --
+ SGML Declaration for HyperText Markup Language version 2.x
+ (HTML 2.x = HTML 2.0 + i18n).
+
+ --
+
+ CHARSET
+ BASESET "ISO Registration Number 177//CHARSET
+ ISO/IEC 10646-1:1993 UCS-4 with
+ implementation level 3//ESC 2/5 2/15 4/6"
+ DESCSET 0 9 UNUSED
+ 9 2 9
+ 11 2 UNUSED
+ 13 1 13
+ 14 18 UNUSED
+ 32 95 32
+ 127 1 UNUSED
+ 128 32 UNUSED
+ 160 2147483486 160
+ --
+ In ISO 10646, the positions with hexadecimal
+ values 0000D800 - 0000DFFF, used in the UTF-16
+
+
+
+Yergeau, et. al. Standards Track [Page 35]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ encoding of UCS-4, are reserved, as well as the last
+ two code values in each plane of UCS-4, i.e. all
+ values of the hexadecimal form xxxxFFFE or xxxxFFFF.
+ These code values or the corresponding numeric
+ character references must not be included when
+ generating a new HTML document, and they should be
+ ignored if encountered when processing a HTML
+ document.
+ --
+
+ CAPACITY SGMLREF
+ TOTALCAP 150000
+ GRPCAP 150000
+ ENTCAP 150000
+
+ SCOPE DOCUMENT
+
+ SYNTAX
+ SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
+ 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127
+
+ BASESET "ISO 646IRV:1991//CHARSET
+ International Reference Version
+ (IRV)//ESC 2/8 4/2"
+ DESCSET 0 128 0
+
+ FUNCTION
+ RE 13
+ RS 10
+ SPACE 32
+ TAB SEPCHAR 9
+
+ NAMING LCNMSTRT ""
+ UCNMSTRT ""
+ LCNMCHAR ".-"
+ UCNMCHAR ".-"
+ NAMECASE GENERAL YES
+ ENTITY NO
+ DELIM GENERAL SGMLREF
+ SHORTREF SGMLREF
+ NAMES SGMLREF
+ QUANTITY SGMLREF
+ ATTSPLEN 2100
+ LITLEN 1024
+ NAMELEN 72 -- somewhat arbitrary; taken from
+ internet line length conventions --
+ PILEN 1024
+ TAGLVL 100
+
+
+
+Yergeau, et. al. Standards Track [Page 36]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ TAGLEN 2100
+ GRPGTCNT 150
+ GRPCNT 64
+
+ FEATURES
+ MINIMIZE
+ DATATAG NO
+ OMITTAG YES
+ RANK NO
+ SHORTTAG YES
+ LINK
+ SIMPLE NO
+ IMPLICIT NO
+ EXPLICIT NO
+ OTHER
+ CONCUR NO
+ SUBDOC NO
+ FORMAL YES
+ APPINFO "SDA" -- conforming SGML Document Access application
+ --
+ >
+
+7.3. ISO Latin 1 entity set
+
+ The following public text lists each of the characters specified in
+ the Added Latin 1 entity set, along with its name, syntax for use,
+ and description. This list is derived from ISO Standard
+ 8879:1986//ENTITIES Added Latin 1//EN. HTML includes the entire
+ entity set, and adds entities for all missing characters in the right
+ part of ISO-8859-1.
+
+ <!-- (C) International Organization for Standardization 1986
+ Permission to copy in any form is granted for use with
+ conforming SGML systems and applications as defined in
+ ISO 8879, provided this notice is included in all copies.
+ -->
+ <!-- Character entity set. Typical invocation:
+ <!ENTITY % ISOlat1 PUBLIC
+ "ISO 8879-1986//ENTITIES Added Latin 1//EN//HTML">
+ %ISOlat1;
+ -->
+ <!ENTITY nbsp CDATA "&#160;" -- no-break space -->
+ <!ENTITY iexcl CDATA "&#161;" -- inverted exclamation mark -->
+ <!ENTITY cent CDATA "&#162;" -- cent sign -->
+ <!ENTITY pound CDATA "&#163;" -- pound sterling sign -->
+ <!ENTITY curren CDATA "&#164;" -- general currency sign -->
+ <!ENTITY yen CDATA "&#165;" -- yen sign -->
+ <!ENTITY brvbar CDATA "&#166;" -- broken (vertical) bar -->
+
+
+
+Yergeau, et. al. Standards Track [Page 37]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ <!ENTITY sect CDATA "&#167;" -- section sign -->
+ <!ENTITY uml CDATA "&#168;" -- umlaut (dieresis) -->
+ <!ENTITY copy CDATA "&#169;" -- copyright sign -->
+ <!ENTITY ordf CDATA "&#170;" -- ordinal indicator, feminine -->
+ <!ENTITY laquo CDATA "&#171;" -- angle quotation mark, left -->
+ <!ENTITY not CDATA "&#172;" -- not sign -->
+ <!ENTITY shy CDATA "&#173;" -- soft hyphen -->
+ <!ENTITY reg CDATA "&#174;" -- registered sign -->
+ <!ENTITY macr CDATA "&#175;" -- macron -->
+ <!ENTITY deg CDATA "&#176;" -- degree sign -->
+ <!ENTITY plusmn CDATA "&#177;" -- plus-or-minus sign -->
+ <!ENTITY sup2 CDATA "&#178;" -- superscript two -->
+ <!ENTITY sup3 CDATA "&#179;" -- superscript three -->
+ <!ENTITY acute CDATA "&#180;" -- acute accent -->
+ <!ENTITY micro CDATA "&#181;" -- micro sign -->
+ <!ENTITY para CDATA "&#182;" -- pilcrow (paragraph sign) -->
+ <!ENTITY middot CDATA "&#183;" -- middle dot -->
+ <!ENTITY cedil CDATA "&#184;" -- cedilla -->
+ <!ENTITY sup1 CDATA "&#185;" -- superscript one -->
+ <!ENTITY ordm CDATA "&#186;" -- ordinal indicator, masculine -->
+ <!ENTITY raquo CDATA "&#187;" -- angle quotation mark, right -->
+ <!ENTITY frac14 CDATA "&#188;" -- fraction one-quarter -->
+ <!ENTITY frac12 CDATA "&#189;" -- fraction one-half -->
+ <!ENTITY frac34 CDATA "&#190;" -- fraction three-quarters -->
+ <!ENTITY iquest CDATA "&#191;" -- inverted question mark -->
+ <!ENTITY Agrave CDATA "&#192;" -- capital A, grave accent -->
+ <!ENTITY Aacute CDATA "&#193;" -- capital A, acute accent -->
+ <!ENTITY Acirc CDATA "&#194;" -- capital A, circumflex accent -->
+ <!ENTITY Atilde CDATA "&#195;" -- capital A, tilde -->
+ <!ENTITY Auml CDATA "&#196;" -- capital A, dieresis or umlaut -->
+ <!ENTITY Aring CDATA "&#197;" -- capital A, ring -->
+ <!ENTITY AElig CDATA "&#198;" -- capital AE diphthong (ligature) -->
+ <!ENTITY Ccedil CDATA "&#199;" -- capital C, cedilla -->
+ <!ENTITY Egrave CDATA "&#200;" -- capital E, grave accent -->
+ <!ENTITY Eacute CDATA "&#201;" -- capital E, acute accent -->
+ <!ENTITY Ecirc CDATA "&#202;" -- capital E, circumflex accent -->
+ <!ENTITY Euml CDATA "&#203;" -- capital E, dieresis or umlaut -->
+ <!ENTITY Igrave CDATA "&#204;" -- capital I, grave accent -->
+ <!ENTITY Iacute CDATA "&#205;" -- capital I, acute accent -->
+ <!ENTITY Icirc CDATA "&#206;" -- capital I, circumflex accent -->
+ <!ENTITY Iuml CDATA "&#207;" -- capital I, dieresis or umlaut -->
+ <!ENTITY ETH CDATA "&#208;" -- capital Eth, Icelandic -->
+ <!ENTITY Ntilde CDATA "&#209;" -- capital N, tilde -->
+ <!ENTITY Ograve CDATA "&#210;" -- capital O, grave accent -->
+ <!ENTITY Oacute CDATA "&#211;" -- capital O, acute accent -->
+ <!ENTITY Ocirc CDATA "&#212;" -- capital O, circumflex accent -->
+ <!ENTITY Otilde CDATA "&#213;" -- capital O, tilde -->
+ <!ENTITY Ouml CDATA "&#214;" -- capital O, dieresis or umlaut -->
+
+
+
+Yergeau, et. al. Standards Track [Page 38]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ <!ENTITY times CDATA "&#215;" -- multiply sign -->
+ <!ENTITY Oslash CDATA "&#216;" -- capital O, slash -->
+ <!ENTITY Ugrave CDATA "&#217;" -- capital U, grave accent -->
+ <!ENTITY Uacute CDATA "&#218;" -- capital U, acute accent -->
+ <!ENTITY Ucirc CDATA "&#219;" -- capital U, circumflex accent -->
+ <!ENTITY Uuml CDATA "&#220;" -- capital U, dieresis or umlaut -->
+ <!ENTITY Yacute CDATA "&#221;" -- capital Y, acute accent -->
+ <!ENTITY THORN CDATA "&#222;" -- capital Thorn, Icelandic -->
+ <!ENTITY szlig CDATA "&#223;" -- small sharp s, German (sz ligature) -->
+ <!ENTITY agrave CDATA "&#224;" -- small a, grave accent -->
+ <!ENTITY aacute CDATA "&#225;" -- small a, acute accent -->
+ <!ENTITY acirc CDATA "&#226;" -- small a, circumflex accent -->
+ <!ENTITY atilde CDATA "&#227;" -- small a, tilde -->
+ <!ENTITY auml CDATA "&#228;" -- small a, dieresis or umlaut -->
+ <!ENTITY aring CDATA "&#229;" -- small a, ring -->
+ <!ENTITY aelig CDATA "&#230;" -- small ae diphthong (ligature) -->
+ <!ENTITY ccedil CDATA "&#231;" -- small c, cedilla -->
+ <!ENTITY egrave CDATA "&#232;" -- small e, grave accent -->
+ <!ENTITY eacute CDATA "&#233;" -- small e, acute accent -->
+ <!ENTITY ecirc CDATA "&#234;" -- small e, circumflex accent -->
+ <!ENTITY euml CDATA "&#235;" -- small e, dieresis or umlaut -->
+ <!ENTITY igrave CDATA "&#236;" -- small i, grave accent -->
+ <!ENTITY iacute CDATA "&#237;" -- small i, acute accent -->
+ <!ENTITY icirc CDATA "&#238;" -- small i, circumflex accent -->
+ <!ENTITY iuml CDATA "&#239;" -- small i, dieresis or umlaut -->
+ <!ENTITY eth CDATA "&#240;" -- small eth, Icelandic -->
+ <!ENTITY ntilde CDATA "&#241;" -- small n, tilde -->
+ <!ENTITY ograve CDATA "&#242;" -- small o, grave accent -->
+ <!ENTITY oacute CDATA "&#243;" -- small o, acute accent -->
+ <!ENTITY ocirc CDATA "&#244;" -- small o, circumflex accent -->
+ <!ENTITY otilde CDATA "&#245;" -- small o, tilde -->
+ <!ENTITY ouml CDATA "&#246;" -- small o, dieresis or umlaut -->
+ <!ENTITY divide CDATA "&#247;" -- divide sign -->
+ <!ENTITY oslash CDATA "&#248;" -- small o, slash -->
+ <!ENTITY ugrave CDATA "&#249;" -- small u, grave accent -->
+ <!ENTITY uacute CDATA "&#250;" -- small u, acute accent -->
+ <!ENTITY ucirc CDATA "&#251;" -- small u, circumflex accent -->
+ <!ENTITY uuml CDATA "&#252;" -- small u, dieresis or umlaut -->
+ <!ENTITY yacute CDATA "&#253;" -- small y, acute accent -->
+ <!ENTITY thorn CDATA "&#254;" -- small thorn, Icelandic -->
+ <!ENTITY yuml CDATA "&#255;" -- small y, dieresis or umlaut -->
+
+
+
+
+
+
+
+
+
+
+Yergeau, et. al. Standards Track [Page 39]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+8. Security Considerations
+
+ Anchors, embedded images, and all other elements which contain URIs
+ as parameters may cause the URI to be dereferenced in response to
+ user input. In this case, the security considerations of [RFC1738]
+ apply.
+
+ The widely deployed methods for submitting form requests -- HTTP and
+ SMTP -- provide little assurance of confidentiality. Information
+ providers who request sensitive information via forms -- especially
+ by way of the `PASSWORD' type input field (see section 8.1.2 in
+ [RFC1866]) -- should be aware and make their users aware of the lack
+ of confidentiality.
+
+Bibliography
+
+ [BRYAN88] M. Bryan, "SGML -- An Author's Guide to the Standard
+ Generalized Markup Language", Addison-Wesley, Reading,
+ 1988.
+
+ [ERCS] Extended Reference Concrete Syntax for SGML.
+ <http://www.sgmlopen.org/sgml/docs/ercs/ercs-
+ home.html>
+
+ [GOLD90] C. F. Goldfarb, "The SGML Handbook", Y. Rubinsky, Ed.,
+ Oxford University Press, 1990.
+
+ [HTTP-1.1] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
+ and T. Berners-Lee, "Hypertext Transfer Protocol --
+ HTTP/1.1", RFC 2068, January 1997.
+
+ [ISO-639] ISO 639:1988. International standard -- Code for the
+ representation of the names of languages. Technical
+ content in <http://www.sil.org/sgml/iso639a.html>
+
+ [ISO-8859] ISO 8859. International standard -- Information pro-
+ cessing -- 8-bit single-byte coded graphic character
+ sets -- Part 1: Latin alphabet No. 1 (1987) -- Part 2:
+ Latin alphabet No. 2 (1987) -- Part 3: Latin alphabet
+ No. 3 (1988) -- Part 4: Latin alphabet No. 4 (1988) --
+ Part 5: Latin/Cyrillic alphabet (1988) -- Part 6:
+ Latin/Arabic alphabet (1987) -- Part : Latin/Greek
+ alphabet (1987) -- Part 8: Latin/Hebrew alphabet
+ (1988) -- Part 9: Latin alphabet No. 5 (1989) -- Part
+ 10: Latin alphabet No. 6 (1992)
+
+
+
+
+
+
+Yergeau, et. al. Standards Track [Page 40]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ [ISO-8879] ISO 8879:1986. International standard -- Information
+ processing -- Text and office systems -- Standard gen-
+ eralized markup language (SGML).
+
+ [ISO-10646] ISO/IEC 10646-1:1993. International standard -- Infor-
+ mation technology -- Universal multiple-octet coded
+ character Sset (UCS) -- Part 1: Architecture and basic
+ multilingual plane.
+
+ [NICOL] G.T. Nicol, "The Multilingual World Wide Web",
+ Electronic Book Technologies, 1995,
+ <http://www.ebt.com/docs/multling.html>
+
+ [NICOL2] G.T. Nicol, "MIME Header Supplemented File Type", Work
+ in Progress, EBT, October 1995.
+
+ [RFC1345] Simonsen, K., "Character Mnemonics & Character Sets",
+ RFC 1345, Rationel Almen Planlaegning, June 1992.
+
+ [RFC1468] Murai, J., Crispin M., and E. van der Poel,
+ "Japanese Character Encoding for Internet Messages",
+ RFC 1468, Keio University, Panda Programming, June
+ 1993.
+
+ [RFC2045] Freed, N., and N. Borenstein, "Multipurpose Internet
+ Mail Extensions (MIME) Part One: Format of Internet
+ Message Bodies", RFC 2045, Innosoft, First Virtual,
+ November 1996.
+
+ [RFC1641] Goldsmith, D., and M.Davis, "Using Unicode with MIME",
+ RFC 1641, Taligent inc., July 1994.
+
+ [RFC1642] Goldsmith, D., and M. Davis, "UTF-7: A Mail-safe
+ Transformation Format of Unicode", RFC 1642, Taligent,
+ Inc., July 1994.
+
+ [RFC1738] Berners-Lee, T., Masinter, L., and M. McCahill,
+ "Uniform Resource Locators (URL)", RFC 1738, CERN,
+ Xerox PARC, University of Minnesota, October 1994.
+
+ [RFC1766] Alverstrand, H., "Tags for the Identification of
+ Languages", RFC 1766, UNINETT, March 1995.
+
+ [RFC1866] Berners-Lee, T., and D. Connolly, "Hypertext Markup
+ Language - 2.0", RFC 1866, MIT/W3C, November 1995.
+
+ [RFC1867] Nebel, E., and L. Masinter, "Form-based File Upload
+ in HTML", RFC 1867, Xerox Corporation, November 1995.
+
+
+
+Yergeau, et. al. Standards Track [Page 41]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+ [RFC1942] Raggett, D., "HTML Tables", RFC 1942, W3C, May 1996.
+
+ [RFC2068] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
+ and T. Berners-Lee, "Hypertext Transfer Protocol --
+ HTTP/1.1", RFC 2068, January 1997.
+
+ [SQ91] SoftQuad, "The SGML Primer", 3rd ed., SoftQuad Inc.,
+ 1991.
+
+ [TAKADA] Toshihiro Takada, "Multilingual Information Exchange
+ through the World-Wide Web", Computer Networks and
+ ISDN Systems, Vol. 27, No. 2, Nov. 1994 , p. 235-241.
+
+ [TEI] TEI Guidelines for Electronic Text Encoding and Inter-
+ change. <http://etext.virgina.edu/TEI.html>
+
+ [UNICODE] The Unicode Consortium, "The Unicode Standard --
+ Worldwide Character Encoding -- Version 1.0", Addison-
+ Wesley, Volume 1, 1991, Volume 2, 1992, and Technical
+ Report #4, 1993. The BIDI algorithm is in appendix A
+ of volume 1, with corrections in appendix D of volume
+ 2.
+
+ [UTF-8] ISO/IEC 10646-1:1993 AMENDMENT 2 (1996). UCS Transfor-
+ mation Format 8 (UTF-8).
+
+ [VANH90] E. van Hervijnen, "Practical SGML", Kluwer Academicq
+ Publishers Group, Norwell and Dordrecht, 1990.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Yergeau, et. al. Standards Track [Page 42]
+
+RFC 2070 HTML Internationalization January 1997
+
+
+Authors' Addresses
+
+ Frangois Yergeau
+ Alis Technologies
+ 100, boul. Alexis-Nihon, bureau 600
+ Montrial QC H4M 2P2
+ Canada
+
+ Tel: +1 (514) 747-2547
+ Fax: +1 (514) 747-2561
+ EMail: fyergeau@alis.com
+
+
+ Gavin Thomas Nicol
+ Electronic Book Technologies, Japan
+ 1-29-9 Tsurumaki,
+ Setagaya-ku,
+ Tokyo
+ Japan
+
+ Tel: +81-3-3230-8161
+ Fax: +81-3-3230-8163
+ EMail: gtn@ebt.com, gtn@twics.co.jp
+
+
+ Glenn Adams
+ Spyglass
+ 118 Magazine Street
+ Cambridge, MA 02139
+ U.S.A.
+
+ Tel: +1 (617) 864-5524
+ Fax: +1 (617) 864-4965
+ EMail: glenn@spyglass.com
+
+
+ Martin J. Duerst
+ Multimedia-Laboratory
+ Department of Computer Science
+ University of Zurich
+ Winterthurerstrasse 190
+ CH-8057 Zurich
+ Switzerland
+
+ Tel: +41 1 257 43 16
+ Fax: +41 1 363 00 35
+ EMail: mduerst@ifi.unizh.ch
+
+
+
+
+Yergeau, et. al. Standards Track [Page 43]
+