summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc2640.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/rfc/rfc2640.txt')
-rw-r--r--doc/rfc/rfc2640.txt1515
1 files changed, 1515 insertions, 0 deletions
diff --git a/doc/rfc/rfc2640.txt b/doc/rfc/rfc2640.txt
new file mode 100644
index 0000000..73ff879
--- /dev/null
+++ b/doc/rfc/rfc2640.txt
@@ -0,0 +1,1515 @@
+
+
+
+
+
+
+Network Working Group B. Curtin
+Request for Comments: 2640 Defense Information Systems Agency
+Updates: 959 July 1999
+Category: Proposed Standard
+
+
+ Internationalization of the File Transfer Protocol
+
+Status of this Memo
+
+ This document specifies an Internet standards track protocol for the
+ Internet community, and requests discussion and suggestions for
+ improvements. Please refer to the current edition of the "Internet
+ Official Protocol Standards" (STD 1) for the standardization state
+ and status of this protocol. Distribution of this memo is unlimited.
+
+Copyright Notice
+
+ Copyright (C) The Internet Society (1999). All Rights Reserved.
+
+Abstract
+
+ The File Transfer Protocol, as defined in RFC 959 [RFC959] and RFC
+ 1123 Section 4 [RFC1123], is one of the oldest and widely used
+ protocols on the Internet. The protocol's primary character set, 7
+ bit ASCII, has served the protocol well through the early growth
+ years of the Internet. However, as the Internet becomes more global,
+ there is a need to support character sets beyond 7 bit ASCII.
+
+ This document addresses the internationalization (I18n) of FTP, which
+ includes supporting the multiple character sets and languages found
+ throughout the Internet community. This is achieved by extending the
+ FTP specification and giving recommendations for proper
+ internationalization support.
+
+Table of Contents
+
+ ABSTRACT.......................................................1
+ 1 INTRODUCTION.................................................2
+ 1.1 Requirements Terminology..................................2
+ 2 INTERNATIONALIZATION.........................................3
+ 2.1 International Character Set...............................3
+ 2.2 Transfer Encoding Set.....................................4
+ 3 PATHNAMES....................................................5
+ 3.1 General compliance........................................5
+ 3.2 Servers compliance........................................6
+ 3.3 Clients compliance........................................7
+ 4 LANGUAGE SUPPORT.............................................7
+
+
+
+Curtin Proposed Standard [Page 1]
+
+RFC 2640 FTP Internalization July 1999
+
+
+ 4.1 The LANG command..........................................8
+ 4.2 Syntax of the LANG command................................9
+ 4.3 Feat response for LANG command...........................11
+ 4.3.1 Feat examples.........................................11
+ 5 SECURITY CONSIDERATIONS.....................................12
+ 6 ACKNOWLEDGMENTS.............................................12
+ 7 GLOSSARY....................................................13
+ 8 BIBLIOGRAPHY................................................13
+ 9 AUTHOR'S ADDRESS............................................15
+ ANNEX A - IMPLEMENTATION CONSIDERATIONS.......................16
+ A.1 General Considerations...................................16
+ A.2 Transition Considerations................................18
+ ANNEX B - SAMPLE CODE AND EXAMPLES............................19
+ B.1 Valid UTF-8 check........................................19
+ B.2 Conversions..............................................20
+ B.2.1 Conversion from Local Character Set to UTF-8..........20
+ B.2.2 Conversion from UTF-8 to Local Character Set..........23
+ B.2.3 ISO/IEC 8859-8 Example................................25
+ B.2.4 Vendor Codepage Example...............................25
+ B.3 Pseudo Code for Translating Servers......................26
+ Full Copyright Statement......................................27
+
+1 Introduction
+
+ As the Internet grows throughout the world the requirement to support
+ character sets outside of the ASCII [ASCII] / Latin-1 [ISO-8859]
+ character set becomes ever more urgent. For FTP, because of the
+ large installed base, it is paramount that this is done without
+ breaking existing clients and servers. This document addresses this
+ need. In doing so it defines a solution which will still allow the
+ installed base to interoperate with new clients and servers.
+
+ This document enhances the capabilities of the File Transfer Protocol
+ by removing the 7-bit restrictions on pathnames used in client
+ commands and server responses, RECOMMENDs the use of a Universal
+ Character Set (UCS) ISO/IEC 10646 [ISO-10646], RECOMMENDs a UCS
+ transformation format (UTF) UTF-8 [UTF-8], and defines a new command
+ for language negotiation.
+
+ The recommendations made in this document are consistent with the
+ recommendations expressed by the IETF policy related to character
+ sets and languages as defined in RFC 2277 [RFC2277].
+
+1.1. Requirements Terminology
+
+ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+ "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
+ document are to be interpreted as described in BCP 14 [BCP14].
+
+
+
+Curtin Proposed Standard [Page 2]
+
+RFC 2640 FTP Internalization July 1999
+
+
+2 Internationalization
+
+ The File Transfer Protocol was developed when the predominate
+ character sets were 7 bit ASCII and 8 bit EBCDIC. Today these
+ character sets cannot support the wide range of characters needed by
+ multinational systems. Given that there are a number of character
+ sets in current use that provide more characters than 7-bit ASCII, it
+ makes sense to decide on a convenient way to represent the union of
+ those possibilities. To work globally either requires support of a
+ number of character sets and to be able to convert between them, or
+ the use of a single preferred character set. To assure global
+ interoperability this document RECOMMENDS the latter approach and
+ defines a single character set, in addition to NVT ASCII and EBCDIC,
+ which is understandable by all systems. For FTP this character set
+ SHALL be ISO/IEC 10646:1993. For support of global compatibility it
+ is STRONGLY RECOMMENDED that clients and servers use UTF-8 encoding
+ when exchanging pathnames. Clients and servers are, however, under
+ no obligation to perform any conversion on the contents of a file for
+ operations such as STOR or RETR.
+
+ The character set used to store files SHALL remain a local decision
+ and MAY depend on the capability of local operating systems. Prior to
+ the exchange of pathnames they SHOULD be converted into a ISO/IEC
+ 10646 format and UTF-8 encoded. This approach, while allowing
+ international exchange of pathnames, will still allow backward
+ compatibility with older systems because the code set positions for
+ ASCII characters are identical to the one byte sequence in UTF-8.
+
+ Sections 2.1 and 2.2 give a brief description of the international
+ character set and transfer encoding RECOMMENDED by this document. A
+ more thorough description of UTF-8, ISO/IEC 10646, and UNICODE
+ [UNICODE], beyond that given in this document, can be found in RFC
+ 2279 [RFC2279].
+
+2.1 International Character Set
+
+ The character set defined for international support of FTP SHALL be
+ the Universal Character Set as defined in ISO 10646:1993 as amended.
+ This standard incorporates the character sets of many existing
+ international, national, and corporate standards. ISO/IEC 10646
+ defines two alternate forms of encoding, UCS-4 and UCS-2. UCS-4 is a
+ four byte (31 bit) encoding containing 2**31 code positions divided
+ into 128 groups of 256 planes. Each plane consists of 256 rows of 256
+ cells. UCS-2 is a 2 byte (16 bit) character set consisting of plane
+ zero or the Basic Multilingual Plane (BMP). Currently, no codesets
+ have been defined outside of the 2 byte BMP.
+
+
+
+
+
+Curtin Proposed Standard [Page 3]
+
+RFC 2640 FTP Internalization July 1999
+
+
+ The Unicode standard version 2.0 [UNICODE] is consistent with the
+ UCS-2 subset of ISO/IEC 10646. The Unicode standard version 2.0
+ includes the repertoire of IS 10646 characters, amendments 1-7 of IS
+ 10646, and editorial and technical corrigenda.
+
+2.2 Transfer Encoding
+
+ UCS Transformation Format 8 (UTF-8), in the past referred to as UTF-2
+ or UTF-FSS, SHALL be used as a transfer encoding to transmit the
+ international character set. UTF-8 is a file safe encoding which
+ avoids the use of byte values that have special significance during
+ the parsing of pathname character strings. UTF-8 is an 8 bit encoding
+ of the characters in the UCS. Some of UTF-8's benefits are that it is
+ compatible with 7 bit ASCII, so it doesn't affect programs that give
+ special meanings to various ASCII characters; it is immune to
+ synchronization errors; its encoding rules allow for easy
+ identification; and it has enough space to support a large number of
+ character sets.
+
+ UTF-8 encoding represents each UCS character as a sequence of 1 to 6
+ bytes in length. For all sequences of one byte the most significant
+ bit is ZERO. For all sequences of more than one byte the number of
+ ONE bits in the first byte, starting from the most significant bit
+ position, indicates the number of bytes in the UTF-8 sequence
+ followed by a ZERO bit. For example, the first byte of a 3 byte UTF-8
+ sequence would have 1110 as its most significant bits. Each
+ additional bytes (continuing bytes) in the UTF-8 sequence, contain a
+ ONE bit followed by a ZERO bit as their most significant bits. The
+ remaining free bit positions in the continuing bytes are used to
+ identify characters in the UCS. The relationship between UCS and
+ UTF-8 is demonstrated in the following table:
+
+ UCS-4 range(hex) UTF-8 byte sequence(binary)
+ 00000000 - 0000007F 0xxxxxxx
+ 00000080 - 000007FF 110xxxxx 10xxxxxx
+ 00000800 - 0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
+ 00010000 - 001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ 00200000 - 03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx
+ 10xxxxxx
+ 04000000 - 7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx
+ 10xxxxxx 10xxxxxx
+
+ A beneficial property of UTF-8 is that its single byte sequence is
+ consistent with the ASCII character set. This feature will allow a
+ transition where old ASCII-only clients can still interoperate with
+ new servers that support the UTF-8 encoding.
+
+
+
+
+
+Curtin Proposed Standard [Page 4]
+
+RFC 2640 FTP Internalization July 1999
+
+
+ Another feature is that the encoding rules make it very unlikely that
+ a character sequence from a different character set will be mistaken
+ for a UTF-8 encoded character sequence. Clients and servers can use a
+ simple routine to determine if the character set being exchanged is
+ valid UTF-8. Section B.1 shows a code example of this check.
+
+3 Pathnames
+
+3.1 General compliance
+
+ - The 7-bit restriction for pathnames exchanged is dropped.
+
+ - Many operating system allow the use of spaces <SP>, carriage return
+ <CR>, and line feed <LF> characters as part of the pathname. The
+ exchange of pathnames with these special command characters will
+ cause the pathnames to be parsed improperly. This is because ftp
+ commands associated with pathnames have the form:
+
+ COMMAND <SP> <pathname> <CRLF>.
+
+ To allow the exchange of pathnames containing these characters, the
+ definition of pathname is changed from
+
+ <pathname> ::= <string> ; in BNF format
+ to
+ pathname = 1*(%x01..%xFF) ; in ABNF format [ABNF].
+
+ To avoid mistaking these characters within pathnames as special
+ command characters the following rules will apply:
+
+ There MUST be only one <SP> between a ftp command and the pathname.
+ Implementations MUST assume <SP> characters following the initial
+ <SP> as part of the pathname. For example the pathname in STOR
+ <SP><SP><SP>foo.bar<CRLF> is <SP><SP>foo.bar.
+
+ Current implementations, which may allow multiple <SP> characters as
+ separators between the command and pathname, MUST assure that they
+ comply with this single <SP> convention. Note: Implementations which
+ treat 3 character commands (e.g. CWD, MKD, etc.) as a fixed 4
+ character command by padding the command with a trailing <SP> are in
+ non-compliance to this specification.
+
+ When a <CR> character is encountered as part of a pathname it MUST be
+ padded with a <NUL> character prior to sending the command. On
+ receipt of a pathname containing a <CR><NUL> sequence the <NUL>
+ character MUST be stripped away. This approach is described in the
+ Telnet protocol [RFC854] on pages 11 and 12. For example, to store a
+ pathname foo<CR><LF>boo.bar the pathname would become
+
+
+
+Curtin Proposed Standard [Page 5]
+
+RFC 2640 FTP Internalization July 1999
+
+
+ foo<CR><NUL><LF>boo.bar prior to sending the command STOR
+ <SP>foo<CR><NUL><LF>boo.bar<CRLF>. Upon receipt of the altered
+ pathname the <NUL> character following the <CR> would be stripped
+ away to form the original pathname.
+
+ - Conforming clients and servers MUST support UTF-8 for the transfer
+ and receipt of pathnames. Clients and servers MAY in addition give
+ users a choice of specifying interpretation of pathnames in another
+ encoding. Note that configuring clients and servers to use
+ character sets / encoding other than UTF-8 is outside of the scope
+ of this document. While it is recognized that in certain
+ operational scenarios this may be desirable, this is left as a
+ quality of implementation and operational issue.
+
+ - Pathnames are sequences of bytes. The encoding of names that are
+ valid UTF-8 sequences is assumed to be UTF-8. The character set of
+ other names is undefined. Clients and servers, unless otherwise
+ configured to support a specific native character set, MUST check
+ for a valid UTF-8 byte sequence to determine if the pathname being
+ presented is UTF-8.
+
+ - To avoid data loss, clients and servers SHOULD use the UTF-8
+ encoded pathnames when unable to convert them to a usable code set.
+
+ - There may be cases when the code set / encoding presented to the
+ server or client cannot be determined. In such cases the raw bytes
+ SHOULD be used.
+
+3.2 Servers compliance
+
+ - Servers MUST support the UTF-8 feature in response to the FEAT
+ command [RFC2389]. The UTF-8 feature is a line containing the exact
+ string "UTF8". This string is not case sensitive, but SHOULD be
+ transmitted in upper case. The response to a FEAT command SHOULD
+ be:
+
+ C> feat
+ S> 211- <any descriptive text>
+ S> ...
+ S> UTF8
+ S> ...
+ S> 211 end
+
+ The ellipses indicate placeholders where other features may be
+ included, but are NOT REQUIRED. The one space indentation of the
+ feature lines is mandatory [RFC2389].
+
+
+
+
+
+Curtin Proposed Standard [Page 6]
+
+RFC 2640 FTP Internalization July 1999
+
+
+ - Mirror servers may want to exactly reflect the site that they are
+ mirroring. In such cases servers MAY store and present the exact
+ pathname bytes that it received from the main server.
+
+3.3 Clients compliance
+
+ - Clients which do not require display of pathnames are under no
+ obligation to do so. Non-display clients do not need to conform to
+ requirements associated with display.
+
+ - Clients, which are presented UTF-8 pathnames by the server, SHOULD
+ parse UTF-8 correctly and attempt to display the pathname within
+ the limitation of the resources available.
+
+ - Clients MUST support the FEAT command and recognize the "UTF8"
+ feature (defined in 3.2 above) to determine if a server supports
+ UTF-8 encoding.
+
+ - Character semantics of other names shall remain undefined. If a
+ client detects that a server is non UTF-8, it SHOULD change its
+ display appropriately. How a client implementation handles non
+ UTF-8 is a quality of implementation issue. It MAY try to assume
+ some other encoding, give the user a chance to try to assume
+ something, or save encoding assumptions for a server from one FTP
+ session to another.
+
+ - Glyph rendering is outside the scope of this document. How a client
+ presents characters it cannot display is a quality of
+ implementation issue. This document RECOMMENDS that octets
+ corresponding to non-displayable characters SHOULD be presented in
+ URL %HH format defined in RFC 1738 [RFC1738]. They MAY, however,
+ display them as question marks, with their UCS hexadecimal value,
+ or in any other suitable fashion.
+
+ - Many existing clients interpret 8-bit pathnames as being in the
+ local character set. They MAY continue to do so for pathnames that
+ are not valid UTF-8.
+
+4. Language Support
+
+ The Character Set Workshop Report [RFC2130] suggests that clients and
+ servers SHOULD negotiate a language for "greetings" and "error
+ messages". This specification interprets the use of the term "error
+ message", by RFC 2130, to mean any explanatory text string returned
+ by server-PI in response to a user-PI command.
+
+
+
+
+
+
+Curtin Proposed Standard [Page 7]
+
+RFC 2640 FTP Internalization July 1999
+
+
+ Implementers SHOULD note that FTP commands and numeric responses are
+ protocol elements. As such, their use is not affected by any guidance
+ expressed by this specification.
+
+ Language support of greetings and command responses shall be the
+ default language supported by the server or the language supported by
+ the server and selected by the client.
+
+ It may be possible to achieve language support through a virtual host
+ as described in [MLST]. However, an FTP server might not support
+ virtual servers, or virtual servers might be configured to support an
+ environment without regard for language. To allow language
+ negotiation this specification defines a new LANG command. Clients
+ and servers that comply with this specification MUST support the LANG
+ command.
+
+4.1 The LANG command
+
+ A new command "LANG" is added to the FTP command set to allow
+ server-FTP process to determine in which language to present server
+ greetings and the textual part of command responses. The parameter
+ associated with the LANG command SHALL be one of the language tags
+ defined in RFC 1766 [RFC1766]. If a LANG command without a parameter
+ is issued the server's default language will be used.
+
+ Greetings and responses issued prior to language negotiation SHALL be
+ in the server's default language. Paragraph 4.5 of [RFC2277] state
+ that this "default language MUST be understandable by an English-
+ speaking person". This specification RECOMMENDS that the server
+ default language be English encoded using ASCII. This text may be
+ augmented by text from other languages. Once negotiated, server-PI
+ MUST return server messages and textual part of command responses in
+ the negotiated language and encoded in UTF-8. Server-PI MAY wish to
+ re-send previously issued server messages in the newly negotiated
+ language.
+
+ The LANG command only affects presentation of greeting messages and
+ explanatory text associated with command responses. No attempt should
+ be made by the server to translate protocol elements (FTP commands
+ and numeric responses) or data transmitted over the data connection.
+
+ User-PI MAY issue the LANG command at any time during an FTP session.
+ In order to gain the full benefit of this command, it SHOULD be
+ presented prior to authentication. In general, it will be issued
+ after the HOST command [MLST]. Note that the issuance of a HOST or
+
+
+
+
+
+
+Curtin Proposed Standard [Page 8]
+
+RFC 2640 FTP Internalization July 1999
+
+
+ REIN command [RFC959] will negate the affect of the LANG command.
+ User-PI SHOULD be capable of supporting UTF-8 encoding for the
+ language negotiated. Guidance on interpretation and rendering of
+ UTF-8, defined in section 3, SHALL apply.
+
+ Although NOT REQUIRED by this specification, a user-PI SHOULD issue a
+ FEAT command [RFC2389] prior to a LANG command. This will allow the
+ user-PI to determine if the server supports the LANG command and
+ which language options.
+
+ In order to aid the server in identifying whether a connection has
+ been established with a client which conforms to this specification
+ or an older client, user-PI MUST send a HOST [MLST] and/or LANG
+ command prior to issuing any other command (other than FEAT
+ [RFC2389]). If user-PI issues a HOST command, and the server's
+ default language is acceptable, it need not issue a LANG command.
+ However, if the implementation does not support the HOST command, a
+ LANG command MUST be issued. Until server-PI is presented with either
+ a HOST or LANG command it SHOULD assume that the user-PI does not
+ comply with this specification.
+
+4.2 Syntax of the LANG command
+
+ The LANG command is defined as follows:
+
+ lang-command = "Lang" [(SP lang-tag)] CRLF
+ lang-tag = Primary-tag *( "-" Sub-tag)
+ Primary-tag = 1*8ALPHA
+ Sub-tag = 1*8ALPHA
+
+ lang-response = lang-ok / error-response
+ lang-ok = "200" [SP *(%x00..%xFF) ] CRLF
+ error-response = command-unrecognized / bad-argument /
+ not-implemented / unsupported-parameter
+ command-unrecognized = "500" [SP *(%x01..%xFF) ] CRLF
+ bad-argument = "501" [SP *(%x01..%xFF) ] CRLF
+ not-implemented = "502" [SP *(%x01..%xFF) ] CRLF
+ unsupported-parameter = "504" [SP *(%x01..%xFF) ] CRLF
+
+ The "lang" command word is case independent and may be specified in
+ any character case desired. Therefore "LANG", "lang", "Lang", and
+ "lAnG" are equivalent commands.
+
+ The OPTIONAL "Lang-tag" given as a parameter specifies the primary
+ language tags and zero or more sub-tags as defined in [RFC1766]. As
+ described in [RFC1766] language tags are treated as case insensitive.
+ If omitted server-PI MUST use the server's default language.
+
+
+
+
+Curtin Proposed Standard [Page 9]
+
+RFC 2640 FTP Internalization July 1999
+
+
+ Server-FTP responds to the "Lang" command with either "lang-ok" or
+ "error-response". "lang-ok" MUST be sent if Server-FTP supports the
+ "Lang" command and can support some form of the "lang-tag". Support
+ SHOULD be as follows:
+
+ - If server-FTP receives "Lang" with no parameters it SHOULD return
+ messages and command responses in the server default language.
+
+ - If server-FTP receives "Lang" with only a primary tag argument
+ (e.g. en, fr, de, ja, zh, etc.), which it can support, it SHOULD
+ return messages and command responses in the language associated
+ with that primary tag. It is possible that server-FTP will only
+ support the primary tag when combined with a sub-tag (e.g. en-US,
+ en-UK, etc.). In such cases, server-FTP MAY determine the
+ appropriate variant to use during the session. How server-FTP makes
+ that determination is outside the scope of this specification. If
+ server-FTP cannot determine if a sub-tag variant is appropriate it
+ SHOULD return an "unsupported-parameter" (504) response.
+
+ - If server-FTP receives "Lang" with a primary tag and sub-tag(s)
+ argument, which is implemented, it SHOULD return messages and
+ command responses in support of the language argument. It is
+ possible that server-FTP can support the primary tag of the "Lang"
+ argument but not the sub-tag(s). In such cases server-FTP MAY
+ return messages and command responses in the most appropriate
+ variant of the primary tag that has been implemented. How server-
+ FTP makes that determination is outside the scope of this
+ specification. If server-FTP cannot determine if a sub-tag variant
+ is appropriate it SHOULD return an "unsupported-parameter" (504)
+ response.
+
+ For example if client-FTP sends a "LANG en-AU" command and server-FTP
+ has implemented language tags en-US and en-UK it may decide that the
+ most appropriate language tag is en-UK and return "200 en-AU not
+ supported. Language set to en-UK". The numeric response is a protocol
+ element and can not be changed. The associated string is for
+ illustrative purposes only.
+
+ Clients and servers that conform to this specification MUST support
+ the LANG command. Clients SHOULD, however, anticipate receiving a 500
+ or 502 command response, in cases where older or non-compliant
+ servers do not recognize or have not implemented the "Lang". A 501
+ response SHOULD be sent if the argument to the "Lang" command is not
+ syntactically correct. A 504 response SHOULD be sent if the "Lang"
+ argument, while syntactically correct, is not implemented. As noted
+ above, an argument may be considered a lexicon match even though it
+ is not an exact syntax match.
+
+
+
+
+Curtin Proposed Standard [Page 10]
+
+RFC 2640 FTP Internalization July 1999
+
+
+4.3 Feat response for LANG command
+
+ A server-FTP process that supports the LANG command, and language
+ support for messages and command responses, MUST include in the
+ response to the FEAT command [RFC2389], a feature line indicating
+ that the LANG command is supported and a fact list of the supported
+ language tags. A response to a FEAT command SHALL be in the following
+ format:
+
+ Lang-feat = SP "LANG" SP lang-fact CRLF
+ lang-fact = lang-tag ["*"] *(";" lang-tag ["*"])
+
+ lang-tag = Primary-tag *( "-" Sub-tag)
+ Primary-tag= 1*8ALPHA
+ Sub-tag = 1*8ALPHA
+
+ The lang-feat response contains the string "LANG" followed by a
+ language fact. This string is not case sensitive, but SHOULD be
+ transmitted in upper case, as recommended in [RFC2389]. The initial
+ space shown in the Lang-feat response is REQUIRED by the FEAT
+ command. It MUST be a single space character. More or less space
+ characters are not permitted. The lang-fact SHALL include the lang-
+ tags which server-FTP can support. At least one lang-tag MUST be
+ included with the FEAT response. The lang-tag SHALL be in the form
+ described earlier in this document. The OPTIONAL asterisk, when
+ present, SHALL indicate the current lang-tag being used by server-FTP
+ for messages and responses.
+
+4.3.1 Feat examples
+
+ C> feat
+ S> 211- <any descriptive text>
+ S> ...
+ S> LANG EN*
+ S> ...
+ S> 211 end
+
+ In this example server-FTP can only support English, which is the
+ current language (as shown by the asterisk) being used by the server
+ for messages and command responses.
+
+ C> feat
+ S> 211- <any descriptive text>
+ S> ...
+ S> LANG EN*;FR
+ S> ...
+ S> 211 end
+
+
+
+
+Curtin Proposed Standard [Page 11]
+
+RFC 2640 FTP Internalization July 1999
+
+
+ C> LANG fr
+ S> 200 Le response sera changez au francais
+
+ C> feat
+ S> 211- <quelconque descriptif texte>
+ S> ...
+ S> LANG EN;FR*
+ S> ...
+ S> 211 end
+
+ In this example server-FTP supports both English and French as shown
+ by the initial response to the FEAT command. The asterisk indicates
+ that English is the current language in use by server-FTP. After a
+ LANG command is issued to change the language to French, the FEAT
+ response shows French as the current language in use.
+
+ In the above examples ellipses indicate placeholders where other
+ features may be included, but are NOT REQUIRED.
+
+5 Security Considerations
+
+ This document addresses the support of character sets beyond 1 byte
+ and a new language negotiation command. Conformance to this document
+ should not induce a security risk.
+
+6 Acknowledgments
+
+ The following people have contributed to this document:
+
+ D. J. Bernstein
+ Martin J. Duerst
+ Mark Harris
+ Paul Hethmon
+ Alun Jones
+ Gregory Lundberg
+ James Matthews
+ Keith Moore
+ Sandra O'Donnell
+ Benjamin Riefenstahl
+ Stephen Tihor
+
+ (and others from the FTPEXT working group)
+
+
+
+
+
+
+
+
+
+Curtin Proposed Standard [Page 12]
+
+RFC 2640 FTP Internalization July 1999
+
+
+7 Glossary
+
+ BIDI - abbreviation for Bi-directional, a reference to mixed right-
+ to-left and left-to-right text.
+
+ Character Set - a collection of characters used to represent textual
+ information in which each character has a numeric value
+
+ Code Set - (see character set).
+
+ Glyph - a character image represented on a display device.
+
+ I18N - "I eighteen N", the first and last letters of the word
+ "internationalization" and the eighteen letters in between.
+
+ UCS-2 - the ISO/IEC 10646 two octet Universal Character Set form.
+
+ UCS-4 - the ISO/IEC 10646 four octet Universal Character Set form.
+
+ UTF-8 - the UCS Transformation Format represented in 8 bits.
+
+ TF-16 - A 16-bit format including the BMP (directly encoded) and
+ surrogate pairs to represent characters in planes 01-16; equivalent
+ to Unicode.
+
+8 Bibliography
+
+ [ABNF] Crocker, D. and P. Overell, "Augmented BNF for Syntax
+ Specifications: ABNF", RFC 2234, November 1997.
+
+ [ASCII] ANSI X3.4:1986 Coded Character Sets - 7 Bit American
+ National Standard Code for Information Interchange (7-
+ bit ASCII)
+
+ [ISO-8859] ISO 8859. International standard -- Information
+ processing -- 8-bit single-byte coded graphic character
+ sets -- Part 1:Latin alphabet No. 1 (1987) -- Part 2:
+ Latin alphabet No. 2 (1987) -- Part 3: Latin alphabet
+ No. 3 (1988) -- Part 4: Latin alphabet No. 4 (1988) --
+ Part 5: Latin/Cyrillic alphabet (1988) -- Part 6:
+ Latin/Arabic alphabet (1987) -- Part : Latin/Greek
+ alphabet (1987) -- Part 8: Latin/Hebrew alphabet (1988)
+ -- Part 9: Latin alphabet No. 5 (1989) -- Part10: Latin
+ alphabet No. 6 (1992)
+
+ [BCP14] Bradner, S., "Key words for use in RFCs to Indicate
+ Requirement Levels", BCP 14, RFC 2119, March 1997.
+
+
+
+
+Curtin Proposed Standard [Page 13]
+
+RFC 2640 FTP Internalization July 1999
+
+
+ [ISO-10646] ISO/IEC 10646-1:1993. International standard --
+ Information technology -- Universal multiple-octet coded
+ character set (UCS) -- Part 1: Architecture and basic
+ multilingual plane.
+
+ [MLST] Elz, R. and P. Hethmon, "Extensions to FTP", Work in
+ Progress.
+
+ [RFC854] Postel, J. and J. Reynolds, "Telnet Protocol
+ Specification", STD 8, RFC 854, May 1983.
+
+ [RFC959] Postel, J. and J. Reynolds, "File Transfer Protocol
+ (FTP)", STD 9, RFC 959, October 1985.
+
+ [RFC1123] Braden, R., "Requirements for Internet Hosts --
+ Application and Support", STD 3, RFC 1123, October 1989.
+
+ [RFC1738] Berners-Lee, T., Masinter, L. and M. McCahill, "Uniform
+ Resource Locators (URL)", RFC 1738, December 1994.
+
+ [RFC1766] Alvestrand, H., "Tags for the Identification of
+ Languages", RFC 1766, March 1995.
+
+ [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
+ Atkinson, R., Crispin, M. and P. Svanberg, "Character
+ Set Workshop Report", RFC 2130, April 1997.
+
+ [RFC2277] Alvestrand, H., " IETF Policy on Character Sets and
+ Languages", RFC 2277, January 1998.
+
+ [RFC2279] Yergeau, F., "UTF-8, a transformation format of ISO
+ 10646", RFC 2279, January 1998.
+
+ [RFC2389] Elz, R. and P. Hethmon, "Feature Negotiation Mechanism
+ for the File Transfer Protocol", RFC 2389, August 1998.
+
+ [UNICODE] The Unicode Consortium, "The Unicode Standard - Version
+ 2.0", Addison Westley Developers Press, July 1996.
+
+ [UTF-8] ISO/IEC 10646-1:1993 AMENDMENT 2 (1996). UCS
+ Transformation Format 8 (UTF-8).
+
+
+
+
+
+
+
+
+
+
+Curtin Proposed Standard [Page 14]
+
+RFC 2640 FTP Internalization July 1999
+
+
+9 Author's Address
+
+ Bill Curtin
+ JIEO
+ Attn: JEBBD
+ Ft. Monmouth, N.J. 07703-5613
+
+ EMail: curtinw@ftm.disa.mil
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Curtin Proposed Standard [Page 15]
+
+RFC 2640 FTP Internalization July 1999
+
+
+Annex A - Implementation Considerations
+
+A.1 General Considerations
+
+ - Implementers should ensure that their code accounts for potential
+ problems, such as using a NULL character to terminate a string or
+ no longer being able to steal the high order bit for internal use,
+ when supporting the extended character set.
+
+ - Implementers should be aware that there is a chance that pathnames
+ that are non UTF-8 may be parsed as valid UTF-8. The probabilities
+ are low for some encoding or statistically zero to zero for others.
+ A recent non-scientific analysis found that EUC encoded Japanese
+ words had a 2.7% false reading; SJIS had a 0.0005% false reading;
+ other encoding such as ASCII or KOI-8 have a 0% false reading. This
+ probability is highest for short pathnames and decreases as
+ pathname size increases. Implementers may want to look for signs
+ that pathnames which parse as UTF-8 are not valid UTF-8, such as
+ the existence of multiple local character sets in short pathnames.
+ Hopefully, as more implementations conform to UTF-8 transfer
+ encoding there will be a smaller need to guess at the encoding.
+
+ - Client developers should be aware that it will be possible for
+ pathnames to contain mixed characters (e.g.
+ //Latin1DirectoryName/HebrewFileName). They should be prepared to
+ handle the Bi-directional (BIDI) display of these character sets
+ (i.e. right to left display for the directory and left to right
+ display for the filename). While bi-directional display is outside
+ the scope of this document and more complicated than the above
+ example, an algorithm for bi-directional display can be found in
+ the UNICODE 2.0 [UNICODE] standard. Also note that pathnames can
+ have different byte ordering yet be logically and display-wise
+ equivalent due to the insertion of BIDI control characters at
+ different points during composition. Also note that mixed character
+ sets may also present problems with font swapping.
+
+ - A server that copies pathnames transparently from a local
+ filesystem may continue to do so. It is then up to the local file
+ creators to use UTF-8 pathnames.
+
+ - Servers can supports charset labeling of files and/or directories,
+ such that different pathnames may have different charsets. The
+ server should attempt to convert all pathnames to UTF-8, but if it
+ can't then it should leave that name in its raw form.
+
+ - Some server's OS do not mandate character sets, but allow
+ administrators to configure it in the FTP server. These servers
+ should be configured to use a particular mapping table (either
+
+
+
+Curtin Proposed Standard [Page 16]
+
+RFC 2640 FTP Internalization July 1999
+
+
+ external or built-in). This will allow the flexibility of defining
+ different charsets for different directories.
+
+ - If the server's OS does not mandate the character set and the FTP
+ server cannot be configured, the server should simply use the raw
+ bytes in the file name. They might be ASCII or UTF-8.
+
+ - If the server is a mirror, and wants to look just like the site it
+ is mirroring, it should store the exact file name bytes that it
+ received from the main server.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Curtin Proposed Standard [Page 17]
+
+RFC 2640 FTP Internalization July 1999
+
+
+A.2 Transition Considerations
+
+ - Servers which support this specification, when presented a pathname
+ from an old client (one which does not support this specification),
+ can nearly always tell whether the pathname is in UTF-8 (see B.1)
+ or in some other code set. In order to support these older clients,
+ servers may wish to default to a non UTF-8 code set. However, how a
+ server supports non UTF-8 is outside the scope of this
+ specification.
+
+ - Clients which support this specification will be able to determine
+ if the server can support UTF-8 (i.e. supports this specification)
+ by the ability of the server to support the FEAT command and the
+ UTF8 feature (defined in 3.2). If the newer clients determine that
+ the server does not support UTF-8 it may wish to default to a
+ different code set. Client developers should take into
+ consideration that pathnames, associated with older servers, might
+ be stored in UTF-8. However, how a client supports non UTF-8 is
+ outside the scope of this specification.
+
+ - Clients and servers can transition to UTF-8 by either converting
+ to/from the local encoding, or the users can store UTF-8 filenames.
+ The former approach is easier on tightly controlled file systems
+ (e.g. PCs and MACs). The latter approach is easier on more free
+ form file systems (e.g. Unix).
+
+ - For interactive use attention should be focused on user interface
+ and ease of use. Non-interactive use requires a consistent and
+ controlled behavior.
+
+ - There may be many applications which reference files under their
+ old raw pathname (e.g. linked URLs). Changing the pathname to UTF-8
+ will cause access to the old URL to fail. A solution may be for the
+ server to act as if there was 2 different pathnames associated with
+ the file. This might be done internal to the server on controlled
+ file systems or by using symbolic links on free form systems. While
+ this approach may work for single file transfer non-interactive
+ use, a non-interactive transfer of all of the files in a directory
+ will produce duplicates. Interactive users may be presented with
+ lists of files which are double the actual number files.
+
+
+
+
+
+
+
+
+
+
+
+Curtin Proposed Standard [Page 18]
+
+RFC 2640 FTP Internalization July 1999
+
+
+Annex B - Sample Code and Examples
+
+B.1 Valid UTF-8 check
+
+ The following routine checks if a byte sequence is valid UTF-8. This
+ is done by checking for the proper tagging of the first and following
+ bytes to make sure they conform to the UTF-8 format. It then checks
+ to assure that the data part of the UTF-8 sequence conforms to the
+ proper range allowed by the encoding. Note: This routine will not
+ detect characters that have not been assigned and therefore do not
+ exist.
+
+int utf8_valid(const unsigned char *buf, unsigned int len)
+{
+ const unsigned char *endbuf = buf + len;
+ unsigned char byte2mask=0x00, c;
+ int trailing = 0; // trailing (continuation) bytes to follow
+
+ while (buf != endbuf)
+ {
+ c = *buf++;
+ if (trailing)
+ if ((c&0xC0) == 0x80) // Does trailing byte follow UTF-8 format?
+ {if (byte2mask) // Need to check 2nd byte for proper range?
+ if (c&byte2mask) // Are appropriate bits set?
+ byte2mask=0x00;
+ else
+ return 0;
+ trailing--; }
+ else
+ return 0;
+ else
+ if ((c&0x80) == 0x00) continue; // valid 1 byte UTF-8
+ else if ((c&0xE0) == 0xC0) // valid 2 byte UTF-8
+ if (c&0x1E) // Is UTF-8 byte in
+ // proper range?
+ trailing =1;
+ else
+ return 0;
+ else if ((c&0xF0) == 0xE0) // valid 3 byte UTF-8
+ {if (!(c&0x0F)) // Is UTF-8 byte in
+ // proper range?
+ byte2mask=0x20; // If not set mask
+ // to check next byte
+ trailing = 2;}
+ else if ((c&0xF8) == 0xF0) // valid 4 byte UTF-8
+ {if (!(c&0x07)) // Is UTF-8 byte in
+ // proper range?
+
+
+
+Curtin Proposed Standard [Page 19]
+
+RFC 2640 FTP Internalization July 1999
+
+
+ byte2mask=0x30; // If not set mask
+ // to check next byte
+ trailing = 3;}
+ else if ((c&0xFC) == 0xF8) // valid 5 byte UTF-8
+ {if (!(c&0x03)) // Is UTF-8 byte in
+ // proper range?
+ byte2mask=0x38; // If not set mask
+ // to check next byte
+ trailing = 4;}
+ else if ((c&0xFE) == 0xFC) // valid 6 byte UTF-8
+ {if (!(c&0x01)) // Is UTF-8 byte in
+ // proper range?
+ byte2mask=0x3C; // If not set mask
+ // to check next byte
+ trailing = 5;}
+ else return 0;
+ }
+ return trailing == 0;
+}
+
+B.2 Conversions
+
+ The code examples in this section closely reflect the algorithm in
+ ISO 10646 and may not present the most efficient solution for
+ converting to / from UTF-8 encoding. If efficiency is an issue,
+ implementers should use the appropriate bitwise operators.
+
+ Additional code examples and numerous mapping tables can be found at
+ the Unicode site, HTTP://www.unicode.org or FTP://unicode.org.
+
+ Note that the conversion examples below assume that the local
+ character set supported in the operating system is something other
+ than UCS2/UTF-16. There are some operating systems that already
+ support UCS2/UTF-16 (notably Plan 9 and Windows NT). In this case no
+ conversion will be necessary from the local character set to the UCS.
+
+B.2.1 Conversion from Local Character Set to UTF-8
+
+ Conversion from the local filesystem character set to UTF-8 will
+ normally involve a two step process. First convert the local
+ character set to the UCS; then convert the UCS to UTF-8.
+
+ The first step in the process can be performed by maintaining a
+ mapping table that includes the local character set code and the
+ corresponding UCS code. For instance the ISO/IEC 8859-8 [ISO-8859]
+ code for the Hebrew letter "VAV" is 0xE4. The corresponding 4 byte
+ ISO/IEC 10646 code is 0x000005D5.
+
+
+
+
+Curtin Proposed Standard [Page 20]
+
+RFC 2640 FTP Internalization July 1999
+
+
+ The next step is to convert the UCS character code to the UTF-8
+ encoding. The following routine can be used to determine and encode
+ the correct number of bytes based on the UCS-4 character code:
+
+ unsigned int ucs4_to_utf8 (unsigned long *ucs4_buf, unsigned int
+ ucs4_len, unsigned char *utf8_buf)
+
+ {
+ const unsigned long *ucs4_endbuf = ucs4_buf + ucs4_len;
+ unsigned int utf8_len = 0; // return value for UTF8 size
+ unsigned char *t_utf8_buf = utf8_buf; // Temporary pointer
+ // to load UTF8 values
+
+ while (ucs4_buf != ucs4_endbuf)
+ {
+ if ( *ucs4_buf <= 0x7F) // ASCII chars no conversion needed
+ {
+ *t_utf8_buf++ = (unsigned char) *ucs4_buf;
+ utf8_len++;
+ ucs4_buf++;
+ }
+ else
+ if ( *ucs4_buf <= 0x07FF ) // In the 2 byte utf-8 range
+ {
+ *t_utf8_buf++= (unsigned char) (0xC0 + (*ucs4_buf/0x40));
+ *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40));
+ utf8_len+=2;
+ ucs4_buf++;
+ }
+ else
+ if ( *ucs4_buf <= 0xFFFF ) /* In the 3 byte utf-8 range. The
+ values 0x0000FFFE, 0x0000FFFF
+ and 0x0000D800 - 0x0000DFFF do
+ not occur in UCS-4 */
+ {
+ *t_utf8_buf++= (unsigned char) (0xE0 +
+ (*ucs4_buf/0x1000));
+ *t_utf8_buf++= (unsigned char) (0x80 +
+ ((*ucs4_buf/0x40)%0x40));
+ *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40));
+ utf8_len+=3;
+ ucs4_buf++;
+ }
+ else
+ if ( *ucs4_buf <= 0x1FFFFF ) //In the 4 byte utf-8 range
+ {
+ *t_utf8_buf++= (unsigned char) (0xF0 +
+ (*ucs4_buf/0x040000));
+
+
+
+Curtin Proposed Standard [Page 21]
+
+RFC 2640 FTP Internalization July 1999
+
+
+ *t_utf8_buf++= (unsigned char) (0x80 +
+ ((*ucs4_buf/0x10000)%0x40));
+ *t_utf8_buf++= (unsigned char) (0x80 +
+ ((*ucs4_buf/0x40)%0x40));
+ *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40));
+ utf8_len+=4;
+ ucs4_buf++;
+
+ }
+ else
+ if ( *ucs4_buf <= 0x03FFFFFF )//In the 5 byte utf-8 range
+ {
+ *t_utf8_buf++= (unsigned char) (0xF8 +
+ (*ucs4_buf/0x01000000));
+ *t_utf8_buf++= (unsigned char) (0x80 +
+ ((*ucs4_buf/0x040000)%0x40));
+ *t_utf8_buf++= (unsigned char) (0x80 +
+ ((*ucs4_buf/0x1000)%0x40));
+ *t_utf8_buf++= (unsigned char) (0x80 +
+ ((*ucs4_buf/0x40)%0x40));
+ *t_utf8_buf++= (unsigned char) (0x80 +
+ (*ucs4_buf%0x40));
+ utf8_len+=5;
+ ucs4_buf++;
+ }
+ else
+ if ( *ucs4_buf <= 0x7FFFFFFF )//In the 6 byte utf-8 range
+ {
+ *t_utf8_buf++= (unsigned char)
+ (0xF8 +(*ucs4_buf/0x40000000));
+ *t_utf8_buf++= (unsigned char) (0x80 +
+ ((*ucs4_buf/0x01000000)%0x40));
+ *t_utf8_buf++= (unsigned char) (0x80 +
+ ((*ucs4_buf/0x040000)%0x40));
+ *t_utf8_buf++= (unsigned char) (0x80 +
+ ((*ucs4_buf/0x1000)%0x40));
+ *t_utf8_buf++= (unsigned char) (0x80 +
+ ((*ucs4_buf/0x40)%0x40));
+ *t_utf8_buf++= (unsigned char) (0x80 +
+ (*ucs4_buf%0x40));
+ utf8_len+=6;
+ ucs4_buf++;
+
+ }
+ }
+ return (utf8_len);
+ }
+
+
+
+
+Curtin Proposed Standard [Page 22]
+
+RFC 2640 FTP Internalization July 1999
+
+
+B.2.2 Conversion from UTF-8 to Local Character Set
+
+ When moving from UTF-8 encoding to the local character set the
+ reverse procedure is used. First the UTF-8 encoding is transformed
+ into the UCS-4 character set. The UCS-4 is then converted to the
+ local character set from a mapping table (i.e. the opposite of the
+ table used to form the UCS-4 character code).
+
+ To convert from UTF-8 to UCS-4 the free bits (those that do not
+ define UTF-8 sequence size or signify continuation bytes) in a UTF-8
+ sequence are concatenated as a bit string. The bits are then
+ distributed into a four-byte sequence starting from the least
+ significant bits. Those bits not assigned a bit in the four-byte
+ sequence are padded with ZERO bits. The following routine converts
+ the UTF-8 encoding to UCS-4 character codes:
+
+ int utf8_to_ucs4 (unsigned long *ucs4_buf, unsigned int utf8_len,
+ unsigned char *utf8_buf)
+ {
+
+ const unsigned char *utf8_endbuf = utf8_buf + utf8_len;
+ unsigned int ucs_len=0;
+
+ while (utf8_buf != utf8_endbuf)
+ {
+
+ if ((*utf8_buf & 0x80) == 0x00) /*ASCII chars no conversion
+ needed */
+ {
+ *ucs4_buf++ = (unsigned long) *utf8_buf;
+ utf8_buf++;
+ ucs_len++;
+ }
+ else
+ if ((*utf8_buf & 0xE0)== 0xC0) //In the 2 byte utf-8 range
+ {
+ *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xC0) * 0x40)
+ + ( *(utf8_buf+1) - 0x80));
+ utf8_buf += 2;
+ ucs_len++;
+ }
+ else
+ if ( (*utf8_buf & 0xF0) == 0xE0 ) /*In the 3 byte utf-8
+ range */
+ {
+ *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xE0) * 0x1000)
+ + (( *(utf8_buf+1) - 0x80) * 0x40)
+ + ( *(utf8_buf+2) - 0x80));
+
+
+
+Curtin Proposed Standard [Page 23]
+
+RFC 2640 FTP Internalization July 1999
+
+
+ utf8_buf+=3;
+ ucs_len++;
+ }
+ else
+ if ((*utf8_buf & 0xF8) == 0xF0) /* In the 4 byte utf-8
+ range */
+ {
+ *ucs4_buf++ = (unsigned long)
+ (((*utf8_buf - 0xF0) * 0x040000)
+ + (( *(utf8_buf+1) - 0x80) * 0x1000)
+ + (( *(utf8_buf+2) - 0x80) * 0x40)
+ + ( *(utf8_buf+3) - 0x80));
+ utf8_buf+=4;
+ ucs_len++;
+ }
+ else
+ if ((*utf8_buf & 0xFC) == 0xF8) /* In the 5 byte utf-8
+ range */
+ {
+ *ucs4_buf++ = (unsigned long)
+ (((*utf8_buf - 0xF8) * 0x01000000)
+ + ((*(utf8_buf+1) - 0x80) * 0x040000)
+ + (( *(utf8_buf+2) - 0x80) * 0x1000)
+ + (( *(utf8_buf+3) - 0x80) * 0x40)
+ + ( *(utf8_buf+4) - 0x80));
+ utf8_buf+=5;
+ ucs_len++;
+ }
+ else
+ if ((*utf8_buf & 0xFE) == 0xFC) /* In the 6 byte utf-8
+ range */
+ {
+ *ucs4_buf++ = (unsigned long)
+ (((*utf8_buf - 0xFC) * 0x40000000)
+ + ((*(utf8_buf+1) - 0x80) * 0x010000000)
+ + ((*(utf8_buf+2) - 0x80) * 0x040000)
+ + (( *(utf8_buf+3) - 0x80) * 0x1000)
+ + (( *(utf8_buf+4) - 0x80) * 0x40)
+ + ( *(utf8_buf+5) - 0x80));
+ utf8_buf+=6;
+ ucs_len++;
+ }
+
+ }
+ return (ucs_len);
+ }
+
+
+
+
+
+Curtin Proposed Standard [Page 24]
+
+RFC 2640 FTP Internalization July 1999
+
+
+B.2.3 ISO/IEC 8859-8 Example
+
+ This example demonstrates mapping ISO/IEC 8859-8 character set to
+ UTF-8 and back to ISO/IEC 8859-8. As noted earlier, the Hebrew letter
+ "VAV" is convertd from the ISO/IEC 8859-8 character code 0xE4 to the
+ corresponding 4 byte ISO/IEC 10646 code of 0x000005D5 by a simple
+ lookup of a conversion/mapping file.
+
+ The UCS-4 character code is transformed into UTF-8 using the
+ ucs4_to_utf8 routine described earlier by:
+
+ 1. Because the UCS-4 character is between 0x80 and 0x07FF it will map
+ to a 2 byte UTF-8 sequence.
+ 2. The first byte is defined by (0xC0 + (0x000005D5 / 0x40)) = 0xD7.
+
+ 3. The second byte is defined by (0x80 + (0x000005D5 % 0x40)) = 0x95.
+
+ The UTF-8 encoding is transferred back to UCS-4 by using the
+ utf8_to_ucs4 routine described earlier by:
+
+ 1. Because the first byte of the sequence, when the '&' operator with
+ a value of 0xE0 is applied, will produce 0xC0 (0xD7 & 0xE0 = 0xC0)
+ the UTF-8 is a 2 byte sequence.
+ 2. The four byte UCS-4 character code is produced by (((0xD7 - 0xC0)
+ * 0x40) + (0x95 -0x80)) = 0x000005D5.
+
+ Finally, the UCS-4 character code is converted to ISO/IEC 8859-8
+ character code (using the mapping table which matches ISO/IEC 8859-8
+ to UCS-4 ) to produce the original 0xE4 code for the Hebrew letter
+ "VAV".
+
+B.2.4 Vendor Codepage Example
+
+ This example demonstrates the mapping of a codepage to UTF-8 and back
+ to a vendor codepage. Mapping between vendor codepages can be done in
+ a very similar manner as described above. For instance both the PC
+ and Mac codepages reflect the character set from the Thai standard
+ TIS 620-2533. The character code on both platforms for the Thai
+ letter "SO SO" is 0xAB. This character can then be mapped into the
+ UCS-4 by way of a conversion/mapping file to produce the UCS-4 code
+ of 0x0E0B.
+
+ The UCS-4 character code is transformed into UTF-8 using the
+ ucs4_to_utf8 routine described earlier by:
+
+ 1. Because the UCS-4 character is between 0x0800 and 0xFFFF it will
+ map to a 3 byte UTF-8 sequence.
+ 2. The first byte is defined by (0xE0 + (0x00000E0B / 0x1000) = 0xE0.
+
+
+
+Curtin Proposed Standard [Page 25]
+
+RFC 2640 FTP Internalization July 1999
+
+
+ 3. The second byte is defined by (0x80 + ((0x00000E0B / 0x40) %
+ 0x40))) = 0xB8.
+ 4. The third byte is defined by (0x80 + (0x00000E0B % 0x40)) = 0x8B.
+
+ The UTF-8 encoding is transferred back to UCS-4 by using the
+ utf8_to_ucs4 routine described earlier by:
+
+ 1. Because the first byte of the sequence, when the '&' operator with
+ a value of 0xF0 is applied, will produce 0xE0 (0xE0 & 0xF0 = 0xE0)
+ the UTF-8 is a 3 byte sequence.
+ 2. The four byte UCS-4 character code is produced by (((0xE0 - 0xE0)
+ * 0x1000) + ((0xB8 - 0x80) * 0x40) + (0x8B -0x80) = 0x0000E0B.
+
+ Finally, the UCS-4 character code is converted to either the PC or
+ MAC codepage character code (using the mapping table which matches
+ codepage to UCS-4 ) to produce the original 0xAB code for the Thai
+ letter "SO SO".
+
+B.3 Pseudo Code for a High-Quality Translating Server
+
+ if utf8_valid(fn)
+ {
+ attempt to convert fn to the local charset, producing localfn
+ if (conversion fails temporarily) return error
+ if (conversion succeeds)
+ {
+ attempt to open localfn
+ if (open fails temporarily) return error
+ if (open succeeds) return success
+ }
+ }
+ attempt to open fn
+ if (open fails temporarily) return error
+ if (open succeeds) return success
+ return permanent error
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Curtin Proposed Standard [Page 26]
+
+RFC 2640 FTP Internalization July 1999
+
+
+Full Copyright Statement
+
+ Copyright (C) The Internet Society (1999). All Rights Reserved.
+
+ This document and translations of it may be copied and furnished to
+ others, and derivative works that comment on or otherwise explain it
+ or assist in its implementation may be prepared, copied, published
+ and distributed, in whole or in part, without restriction of any
+ kind, provided that the above copyright notice and this paragraph are
+ included on all such copies and derivative works. However, this
+ document itself may not be modified in any way, such as by removing
+ the copyright notice or references to the Internet Society or other
+ Internet organizations, except as needed for the purpose of
+ developing Internet standards in which case the procedures for
+ copyrights defined in the Internet Standards process must be
+ followed, or as required to translate it into languages other than
+ English.
+
+ The limited permissions granted above are perpetual and will not be
+ revoked by the Internet Society or its successors or assigns.
+
+ This document and the information contained herein is provided on an
+ "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
+ TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
+ BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
+ HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
+ MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+
+Acknowledgement
+
+ Funding for the RFC Editor function is currently provided by the
+ Internet Society.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Curtin Proposed Standard [Page 27]
+