diff options
| author | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 | 
|---|---|---|
| committer | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 | 
| commit | 4bfd864f10b68b71482b35c818559068ef8d5797 (patch) | |
| tree | e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc2640.txt | |
| parent | ea76e11061bda059ae9f9ad130a9895cc85607db (diff) | |
doc: Add RFC documents
Diffstat (limited to 'doc/rfc/rfc2640.txt')
| -rw-r--r-- | doc/rfc/rfc2640.txt | 1515 | 
1 files changed, 1515 insertions, 0 deletions
diff --git a/doc/rfc/rfc2640.txt b/doc/rfc/rfc2640.txt new file mode 100644 index 0000000..73ff879 --- /dev/null +++ b/doc/rfc/rfc2640.txt @@ -0,0 +1,1515 @@ + + + + + + +Network Working Group                                          B. Curtin +Request for Comments: 2640            Defense Information Systems Agency +Updates: 959                                                   July 1999 +Category: Proposed Standard + + +           Internationalization of the File Transfer Protocol + +Status of this Memo + +   This document specifies an Internet standards track protocol for the +   Internet community, and requests discussion and suggestions for +   improvements.  Please refer to the current edition of the "Internet +   Official Protocol Standards" (STD 1) for the standardization state +   and status of this protocol.  Distribution of this memo is unlimited. + +Copyright Notice + +   Copyright (C) The Internet Society (1999).  All Rights Reserved. + +Abstract + +   The File Transfer Protocol, as defined in RFC 959 [RFC959] and RFC +   1123 Section 4 [RFC1123], is one of the oldest and widely used +   protocols on the Internet. The protocol's primary character set, 7 +   bit ASCII, has served the protocol well through the early growth +   years of the Internet. However, as the Internet becomes more global, +   there is a need to support character sets beyond 7 bit ASCII. + +   This document addresses the internationalization (I18n) of FTP, which +   includes supporting the multiple character sets and languages found +   throughout the Internet community.  This is achieved by extending the +   FTP specification and giving recommendations for proper +   internationalization support. + +Table of Contents + +   ABSTRACT.......................................................1 +   1 INTRODUCTION.................................................2 +    1.1 Requirements Terminology..................................2 +   2 INTERNATIONALIZATION.........................................3 +    2.1 International Character Set...............................3 +    2.2 Transfer Encoding Set.....................................4 +   3 PATHNAMES....................................................5 +    3.1 General compliance........................................5 +    3.2 Servers compliance........................................6 +    3.3 Clients compliance........................................7 +   4 LANGUAGE SUPPORT.............................................7 + + + +Curtin                     Proposed Standard                    [Page 1] + +RFC 2640                  FTP Internalization                  July 1999 + + +    4.1 The LANG command..........................................8 +    4.2 Syntax of the LANG command................................9 +    4.3 Feat response for LANG command...........................11 +     4.3.1 Feat examples.........................................11 +   5 SECURITY CONSIDERATIONS.....................................12 +   6 ACKNOWLEDGMENTS.............................................12 +   7 GLOSSARY....................................................13 +   8 BIBLIOGRAPHY................................................13 +   9 AUTHOR'S ADDRESS............................................15 +   ANNEX A - IMPLEMENTATION CONSIDERATIONS.......................16 +    A.1 General Considerations...................................16 +    A.2 Transition Considerations................................18 +   ANNEX B - SAMPLE CODE AND EXAMPLES............................19 +    B.1 Valid UTF-8 check........................................19 +    B.2 Conversions..............................................20 +     B.2.1 Conversion from Local Character Set to UTF-8..........20 +     B.2.2 Conversion from UTF-8 to Local Character Set..........23 +     B.2.3 ISO/IEC 8859-8 Example................................25 +     B.2.4 Vendor Codepage Example...............................25 +    B.3 Pseudo Code for Translating Servers......................26 +   Full Copyright Statement......................................27 + +1 Introduction + +   As the Internet grows throughout the world the requirement to support +   character sets outside of the ASCII [ASCII] / Latin-1 [ISO-8859] +   character set becomes ever more urgent.  For FTP, because of the +   large installed base, it is paramount that this is done without +   breaking existing clients and servers. This document addresses this +   need. In doing so it defines a solution which will still allow the +   installed base to interoperate with new clients and servers. + +   This document enhances the capabilities of the File Transfer Protocol +   by removing the 7-bit restrictions on pathnames used in client +   commands and server responses, RECOMMENDs the use of a Universal +   Character Set (UCS) ISO/IEC 10646 [ISO-10646], RECOMMENDs a UCS +   transformation format (UTF) UTF-8 [UTF-8], and defines a new command +   for language negotiation. + +   The recommendations made in this document are consistent with the +   recommendations expressed by the IETF policy related to character +   sets and languages as defined in RFC 2277 [RFC2277]. + +1.1.  Requirements Terminology + +   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", +   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this +   document are to be interpreted as described in BCP 14 [BCP14]. + + + +Curtin                     Proposed Standard                    [Page 2] + +RFC 2640                  FTP Internalization                  July 1999 + + +2 Internationalization + +   The File Transfer Protocol was developed when the predominate +   character sets were 7 bit ASCII and 8 bit EBCDIC. Today these +   character sets cannot support the wide range of characters needed by +   multinational systems. Given that there are a number of character +   sets in current use that provide more characters than 7-bit ASCII, it +   makes sense to decide on a convenient way to represent the union of +   those possibilities. To work globally either requires support of a +   number of character sets and to be able to convert between them, or +   the use of a single preferred character set. To assure global +   interoperability this document RECOMMENDS the latter approach and +   defines a single character set, in addition to NVT ASCII and EBCDIC, +   which is understandable by all systems. For FTP this character set +   SHALL be ISO/IEC 10646:1993.  For support of global compatibility it +   is STRONGLY RECOMMENDED that clients and servers use UTF-8 encoding +   when exchanging pathnames.  Clients and servers are, however, under +   no obligation to perform any conversion on the contents of a file for +   operations such as STOR or RETR. + +   The character set used to store files SHALL remain a local decision +   and MAY depend on the capability of local operating systems. Prior to +   the exchange of pathnames they SHOULD be converted into a ISO/IEC +   10646 format and UTF-8 encoded. This approach, while allowing +   international exchange of pathnames, will still allow backward +   compatibility with older systems because the code set positions for +   ASCII characters are identical to the one byte sequence in UTF-8. + +   Sections 2.1 and 2.2 give a brief description of the international +   character set and transfer encoding RECOMMENDED by this document. A +   more thorough description of UTF-8, ISO/IEC 10646, and UNICODE +   [UNICODE], beyond that given in this document, can be found in RFC +   2279 [RFC2279]. + +2.1 International Character Set + +   The character set defined for international support of FTP SHALL be +   the Universal Character Set as defined in ISO 10646:1993 as amended. +   This standard incorporates the character sets of many existing +   international, national, and corporate standards. ISO/IEC 10646 +   defines two alternate forms of encoding, UCS-4 and UCS-2. UCS-4 is a +   four byte (31 bit) encoding containing 2**31 code positions divided +   into 128 groups of 256 planes. Each plane consists of 256 rows of 256 +   cells. UCS-2 is a 2 byte (16 bit) character set consisting of plane +   zero or the Basic Multilingual Plane (BMP).  Currently, no codesets +   have been defined outside of the 2 byte BMP. + + + + + +Curtin                     Proposed Standard                    [Page 3] + +RFC 2640                  FTP Internalization                  July 1999 + + +   The Unicode standard version 2.0 [UNICODE] is consistent with the +   UCS-2 subset of ISO/IEC 10646. The Unicode standard version 2.0 +   includes the repertoire of IS 10646 characters, amendments 1-7 of IS +   10646, and editorial and technical corrigenda. + +2.2 Transfer Encoding + +   UCS Transformation Format 8 (UTF-8), in the past referred to as UTF-2 +   or UTF-FSS, SHALL be used as a transfer encoding to transmit the +   international character set. UTF-8 is a file safe encoding which +   avoids the use of byte values that have special significance during +   the parsing of pathname character strings. UTF-8 is an 8 bit encoding +   of the characters in the UCS. Some of UTF-8's benefits are that it is +   compatible with 7 bit ASCII, so it doesn't affect programs that give +   special meanings to various ASCII characters; it is immune to +   synchronization errors; its encoding rules allow for easy +   identification; and it has enough space to support a large number of +   character sets. + +   UTF-8 encoding represents each UCS character as a sequence of 1 to 6 +   bytes in length. For all sequences of one byte the most significant +   bit is ZERO. For all sequences of more than one byte the number of +   ONE bits in the first byte, starting from the most significant bit +   position, indicates the number of bytes in the UTF-8 sequence +   followed by a ZERO bit. For example, the first byte of a 3 byte UTF-8 +   sequence would have 1110 as its most significant bits. Each +   additional bytes (continuing bytes) in the UTF-8 sequence, contain a +   ONE bit followed by a ZERO bit as their most significant bits. The +   remaining free bit positions in the continuing bytes are used to +   identify characters in the UCS. The relationship between UCS and +   UTF-8 is demonstrated in the following table: + +   UCS-4 range(hex)          UTF-8 byte sequence(binary) +   00000000 - 0000007F       0xxxxxxx +   00000080 - 000007FF       110xxxxx 10xxxxxx +   00000800 - 0000FFFF       1110xxxx 10xxxxxx 10xxxxxx +   00010000 - 001FFFFF       11110xxx 10xxxxxx 10xxxxxx 10xxxxxx +   00200000 - 03FFFFFF       111110xx 10xxxxxx 10xxxxxx 10xxxxxx +                             10xxxxxx +   04000000 - 7FFFFFFF       1111110x 10xxxxxx 10xxxxxx 10xxxxxx +                             10xxxxxx 10xxxxxx + +   A beneficial property of UTF-8 is that its single byte sequence is +   consistent with the ASCII character set. This feature will allow a +   transition where old ASCII-only clients can still interoperate with +   new servers that support the UTF-8 encoding. + + + + + +Curtin                     Proposed Standard                    [Page 4] + +RFC 2640                  FTP Internalization                  July 1999 + + +   Another feature is that the encoding rules make it very unlikely that +   a character sequence from a different character set will be mistaken +   for a UTF-8 encoded character sequence. Clients and servers can use a +   simple routine to determine if the character set being exchanged is +   valid UTF-8. Section B.1 shows a code example of this check. + +3 Pathnames + +3.1 General compliance + +   - The 7-bit restriction for pathnames exchanged is dropped. + +   - Many operating system allow the use of spaces <SP>, carriage return +     <CR>, and line feed <LF> characters as part of the pathname. The +     exchange of pathnames with these special command characters will +     cause the pathnames to be parsed improperly. This is because ftp +     commands associated with pathnames have the form: + +      COMMAND <SP> <pathname> <CRLF>. + +   To allow the exchange of pathnames containing these characters, the +   definition of pathname is changed from + +     <pathname> ::= <string>   ; in BNF format +   to +     pathname = 1*(%x01..%xFF) ; in ABNF format [ABNF]. + +   To avoid mistaking these characters within pathnames as special +   command characters the following rules will apply: + +   There MUST be only one <SP> between a ftp command and the pathname. +   Implementations MUST assume <SP> characters following the initial +   <SP> as part of the pathname. For example the pathname in STOR +   <SP><SP><SP>foo.bar<CRLF> is <SP><SP>foo.bar. + +   Current implementations, which may allow multiple <SP> characters as +   separators between the command and pathname, MUST assure that they +   comply with this single <SP> convention. Note: Implementations which +   treat 3 character commands (e.g. CWD, MKD, etc.) as a fixed 4 +   character command by padding the command with a trailing <SP> are in +   non-compliance to this specification. + +   When a <CR> character is encountered as part of a pathname it MUST be +   padded with a <NUL> character prior to sending the command. On +   receipt of a pathname containing a <CR><NUL> sequence the <NUL> +   character MUST be stripped away. This approach is described in the +   Telnet protocol [RFC854] on pages 11 and 12. For example, to store a +   pathname foo<CR><LF>boo.bar the pathname would become + + + +Curtin                     Proposed Standard                    [Page 5] + +RFC 2640                  FTP Internalization                  July 1999 + + +   foo<CR><NUL><LF>boo.bar prior to sending the command STOR +   <SP>foo<CR><NUL><LF>boo.bar<CRLF>. Upon receipt of the altered +   pathname the <NUL> character following the <CR> would be stripped +   away to form the original pathname. + +   - Conforming clients and servers MUST support UTF-8 for the transfer +     and receipt of pathnames. Clients and servers MAY in addition give +     users a choice of specifying interpretation of pathnames in another +     encoding. Note that configuring clients and servers to use +     character sets / encoding other than UTF-8 is outside of the scope +     of this document. While it is recognized that in certain +     operational scenarios this may be desirable, this is left as a +     quality of implementation and operational issue. + +   - Pathnames are sequences of bytes.  The encoding of names that are +     valid UTF-8 sequences is assumed to be UTF-8.  The character set of +     other names is undefined. Clients and servers, unless otherwise +     configured to support a specific native character set, MUST check +     for a valid UTF-8 byte sequence to determine if the pathname being +     presented is UTF-8. + +   - To avoid data loss, clients and servers SHOULD use the UTF-8 +     encoded pathnames when unable to convert them to a usable code set. + +   - There may be cases when the code set / encoding presented to the +     server or client cannot be determined. In such cases the raw bytes +     SHOULD be used. + +3.2 Servers compliance + +   - Servers MUST support the UTF-8 feature in response to the FEAT +     command [RFC2389]. The UTF-8 feature is a line containing the exact +     string "UTF8". This string is not case sensitive, but SHOULD be +     transmitted in upper case. The response to a FEAT command SHOULD +     be: + +        C> feat +        S> 211- <any descriptive text> +        S>  ... +        S>  UTF8 +        S>  ... +        S> 211 end + +   The ellipses indicate placeholders where other features may be +   included, but are NOT REQUIRED. The one space indentation of the +   feature lines is mandatory [RFC2389]. + + + + + +Curtin                     Proposed Standard                    [Page 6] + +RFC 2640                  FTP Internalization                  July 1999 + + +   - Mirror servers may want to exactly reflect the site that they are +     mirroring. In such cases servers MAY store and present the exact +     pathname bytes that it received from the main server. + +3.3 Clients compliance + +   - Clients which do not require display of pathnames are under no +     obligation to do so. Non-display clients do not need to conform to +     requirements associated with display. + +   - Clients, which are presented UTF-8 pathnames by the server, SHOULD +     parse UTF-8 correctly and attempt to display the pathname within +     the limitation of the resources available. + +   - Clients MUST support the FEAT command and recognize the "UTF8" +     feature (defined in 3.2 above) to determine if a server supports +     UTF-8 encoding. + +   - Character semantics of other names shall remain undefined. If a +     client detects that a server is non UTF-8, it SHOULD change its +     display appropriately. How a client implementation handles non +     UTF-8 is a quality of implementation issue. It MAY try to assume +     some other encoding, give the user a chance to try to assume +     something, or save encoding assumptions for a server from one FTP +     session to another. + +   - Glyph rendering is outside the scope of this document. How a client +     presents characters it cannot display is a quality of +     implementation issue. This document RECOMMENDS that octets +     corresponding to non-displayable characters SHOULD be presented in +     URL %HH format defined in RFC 1738 [RFC1738]. They MAY, however, +     display them as question marks, with their UCS hexadecimal value, +     or in any other suitable fashion. + +   - Many existing clients interpret 8-bit pathnames as being in the +     local character set. They MAY continue to do so for pathnames that +     are not valid UTF-8. + +4. Language Support + +   The Character Set Workshop Report [RFC2130] suggests that clients and +   servers SHOULD negotiate a language for "greetings" and "error +   messages". This specification interprets the use of the term  "error +   message", by RFC 2130, to mean any explanatory text string returned +   by server-PI in response to a user-PI command. + + + + + + +Curtin                     Proposed Standard                    [Page 7] + +RFC 2640                  FTP Internalization                  July 1999 + + +   Implementers SHOULD note that FTP commands and numeric responses are +   protocol elements. As such, their use is not affected by any guidance +   expressed by this specification. + +   Language support of greetings and command responses shall be the +   default language supported by the server or the language supported by +   the server and selected by the client. + +   It may be possible to achieve language support through a virtual host +   as described in [MLST]. However, an FTP server might not support +   virtual servers, or virtual servers might be configured to support an +   environment without regard for language. To allow language +   negotiation this specification defines a new LANG command. Clients +   and servers that comply with this specification MUST support the LANG +   command. + +4.1 The LANG command + +   A new command "LANG" is added to the FTP command set to allow +   server-FTP process to determine in which language to present server +   greetings and the textual part of command responses. The parameter +   associated with the LANG command SHALL be one of the language tags +   defined in RFC 1766 [RFC1766]. If a LANG command without a parameter +   is issued the server's default language will be used. + +   Greetings and responses issued prior to language negotiation SHALL be +   in the server's default language. Paragraph 4.5 of [RFC2277] state +   that this "default language MUST be understandable by an English- +   speaking person". This specification RECOMMENDS that the server +   default language be English encoded using ASCII. This text may be +   augmented by text from other languages. Once negotiated, server-PI +   MUST return server messages and textual part of command responses in +   the negotiated language and encoded in UTF-8. Server-PI MAY wish to +   re-send previously issued server messages in the newly negotiated +   language. + +   The LANG command only affects presentation of greeting messages and +   explanatory text associated with command responses. No attempt should +   be made by the server to translate protocol elements (FTP commands +   and numeric responses) or data transmitted over the data connection. + +   User-PI MAY issue the LANG command at any time during an FTP session. +   In order to gain the full benefit of this command, it SHOULD be +   presented prior to authentication. In general, it will be issued +   after the HOST command [MLST]. Note that the issuance of a HOST or + + + + + + +Curtin                     Proposed Standard                    [Page 8] + +RFC 2640                  FTP Internalization                  July 1999 + + +   REIN command [RFC959] will negate the affect of the LANG command. +   User-PI SHOULD be capable of supporting UTF-8 encoding for the +   language negotiated. Guidance on interpretation and rendering of +   UTF-8, defined in section 3, SHALL apply. + +   Although NOT REQUIRED by this specification, a user-PI SHOULD issue a +   FEAT command [RFC2389] prior to a LANG command. This will allow the +   user-PI to determine if the server supports the LANG command and +   which language options. + +   In order to aid the server in identifying whether a connection has +   been established with a client which conforms to this specification +   or an older client, user-PI MUST send a HOST [MLST] and/or LANG +   command prior to issuing any other command (other than FEAT +   [RFC2389]). If user-PI issues a HOST command, and the server's +   default language is acceptable, it need not issue a LANG command. +   However, if the implementation does not support the HOST command, a +   LANG command MUST be issued. Until server-PI is presented with either +   a HOST or LANG command it SHOULD assume that the user-PI does not +   comply with this specification. + +4.2 Syntax of the LANG command + +   The LANG command is defined as follows: + +   lang-command       = "Lang" [(SP lang-tag)] CRLF +   lang-tag           = Primary-tag *( "-" Sub-tag) +   Primary-tag        = 1*8ALPHA +   Sub-tag            = 1*8ALPHA + +   lang-response      = lang-ok / error-response +   lang-ok            = "200" [SP *(%x00..%xFF) ] CRLF +   error-response     = command-unrecognized / bad-argument / +                     not-implemented / unsupported-parameter +   command-unrecognized  = "500" [SP *(%x01..%xFF) ] CRLF +   bad-argument       = "501" [SP *(%x01..%xFF) ] CRLF +   not-implemented    = "502" [SP *(%x01..%xFF) ] CRLF +   unsupported-parameter = "504" [SP *(%x01..%xFF) ] CRLF + +   The "lang" command word is case independent and may be specified in +   any character case desired. Therefore "LANG", "lang", "Lang", and +   "lAnG" are equivalent commands. + +   The OPTIONAL "Lang-tag" given as a parameter specifies the primary +   language tags and zero or more sub-tags as defined in [RFC1766]. As +   described in [RFC1766] language tags are treated as case insensitive. +   If omitted server-PI MUST use the server's default language. + + + + +Curtin                     Proposed Standard                    [Page 9] + +RFC 2640                  FTP Internalization                  July 1999 + + +   Server-FTP responds to the "Lang" command with either "lang-ok" or +   "error-response". "lang-ok" MUST be sent if Server-FTP supports the +   "Lang" command and can support some form of the "lang-tag". Support +   SHOULD be as follows: + +   - If server-FTP receives "Lang" with no parameters it SHOULD return +     messages and command responses in the server default language. + +   - If server-FTP receives "Lang" with only a primary tag argument +     (e.g. en, fr, de, ja, zh, etc.), which it can support, it SHOULD +     return messages and command responses in the language associated +     with that primary tag. It is possible that server-FTP will only +     support the primary tag when combined with a sub-tag (e.g. en-US, +     en-UK, etc.). In such cases, server-FTP MAY determine the +     appropriate variant to use during the session. How server-FTP makes +     that determination is outside the scope of this specification. If +     server-FTP cannot determine if a sub-tag variant is appropriate it +     SHOULD return an "unsupported-parameter" (504) response. + +   - If server-FTP receives "Lang" with a primary tag and sub-tag(s) +     argument, which is implemented, it SHOULD return messages and +     command responses in support of the language argument. It is +     possible that server-FTP can support the primary tag of the "Lang" +     argument but not the sub-tag(s). In such cases server-FTP MAY +     return messages and command responses in the most appropriate +     variant of the primary tag that has been implemented. How server- +     FTP makes that determination is outside the scope of this +     specification. If server-FTP cannot determine if a sub-tag variant +     is appropriate it SHOULD return an "unsupported-parameter" (504) +     response. + +   For example if client-FTP sends a "LANG en-AU" command and server-FTP +   has implemented language tags en-US and en-UK it may decide that the +   most appropriate language tag is en-UK and return "200 en-AU not +   supported. Language set to en-UK". The numeric response is a protocol +   element and can not be changed. The associated string is for +   illustrative purposes only. + +   Clients and servers that conform to this specification MUST support +   the LANG command. Clients SHOULD, however, anticipate receiving a 500 +   or 502 command response, in cases where older or non-compliant +   servers do not recognize or have not implemented the "Lang". A 501 +   response SHOULD be sent if the argument to the "Lang" command is not +   syntactically correct. A 504 response SHOULD be sent if the "Lang" +   argument, while syntactically correct, is not implemented. As noted +   above, an argument may be considered a lexicon match even though it +   is not an exact syntax match. + + + + +Curtin                     Proposed Standard                   [Page 10] + +RFC 2640                  FTP Internalization                  July 1999 + + +4.3 Feat response for LANG command + +   A server-FTP process that supports the LANG command, and language +   support for messages and command responses, MUST include in the +   response to the FEAT command [RFC2389], a feature line indicating +   that the LANG command is supported and a fact list of the supported +   language tags. A response to a FEAT command SHALL be in the following +   format: + +        Lang-feat  = SP "LANG" SP lang-fact CRLF +        lang-fact  = lang-tag ["*"] *(";" lang-tag ["*"]) + +        lang-tag   = Primary-tag *( "-" Sub-tag) +        Primary-tag= 1*8ALPHA +        Sub-tag    = 1*8ALPHA + +   The lang-feat response contains the string "LANG" followed by a +   language fact. This string is not case sensitive, but SHOULD be +   transmitted in upper case, as recommended in [RFC2389]. The initial +   space shown in the Lang-feat response is REQUIRED by the FEAT +   command. It MUST be a single space character. More or less space +   characters are not permitted. The lang-fact SHALL include the lang- +   tags which server-FTP can support. At least one lang-tag MUST be +   included with the FEAT response. The lang-tag SHALL be in the form +   described earlier in this document. The OPTIONAL asterisk, when +   present, SHALL indicate the current lang-tag being used by server-FTP +   for messages and responses. + +4.3.1 Feat examples + +        C> feat +        S> 211- <any descriptive text> +        S>  ... +        S>  LANG EN* +        S>  ... +        S> 211 end + +   In this example server-FTP can only support English, which is the +   current language (as shown by the asterisk) being used by the server +   for messages and command responses. + +        C> feat +        S> 211- <any descriptive text> +        S>  ... +        S>  LANG EN*;FR +        S>  ... +        S> 211 end + + + + +Curtin                     Proposed Standard                   [Page 11] + +RFC 2640                  FTP Internalization                  July 1999 + + +        C> LANG fr +        S> 200 Le response sera changez au francais + +        C> feat +        S> 211- <quelconque descriptif texte> +        S>  ... +        S>  LANG EN;FR* +        S>  ... +        S> 211 end + +   In this example server-FTP supports both English and French as shown +   by the initial response to the FEAT command. The asterisk indicates +   that English is the current language in use by server-FTP. After a +   LANG command is issued to change the language to French, the FEAT +   response shows French as the current language in use. + +   In the above examples ellipses indicate placeholders where other +   features may be included, but are NOT REQUIRED. + +5 Security Considerations + +   This document addresses the support of character sets beyond 1 byte +   and a new language negotiation command. Conformance to this document +   should not induce a security risk. + +6 Acknowledgments + +   The following people have contributed to this document: + +   D. J. Bernstein +   Martin J. Duerst +   Mark Harris +   Paul Hethmon +   Alun Jones +   Gregory Lundberg +   James Matthews +   Keith Moore +   Sandra O'Donnell +   Benjamin Riefenstahl +   Stephen Tihor + +   (and others from the FTPEXT working group) + + + + + + + + + +Curtin                     Proposed Standard                   [Page 12] + +RFC 2640                  FTP Internalization                  July 1999 + + +7 Glossary + +   BIDI - abbreviation for Bi-directional, a reference to mixed right- +   to-left and left-to-right text. + +   Character Set - a collection of characters used to represent textual +   information in which each character has a numeric value + +   Code Set -  (see character set). + +   Glyph - a character image represented on a display device. + +   I18N - "I eighteen N", the first and last letters of the word +   "internationalization" and the eighteen letters in between. + +   UCS-2 - the ISO/IEC 10646 two octet Universal Character Set form. + +   UCS-4 - the ISO/IEC 10646 four octet Universal Character Set form. + +   UTF-8 - the UCS Transformation Format represented in 8 bits. + +   TF-16 - A 16-bit format including the BMP (directly encoded) and +   surrogate pairs to represent characters in planes 01-16; equivalent +   to Unicode. + +8 Bibliography + +   [ABNF]       Crocker, D. and P. Overell, "Augmented BNF for Syntax +                Specifications: ABNF", RFC 2234, November 1997. + +   [ASCII]      ANSI X3.4:1986 Coded Character Sets - 7 Bit American +                National Standard Code for Information Interchange (7- +                bit ASCII) + +   [ISO-8859]   ISO 8859.  International standard -- Information +                processing -- 8-bit single-byte coded graphic character +                sets -- Part 1:Latin alphabet No. 1 (1987) -- Part 2: +                Latin alphabet No. 2 (1987) -- Part 3: Latin alphabet +                No. 3 (1988) -- Part 4: Latin alphabet No. 4 (1988) -- +                Part 5: Latin/Cyrillic alphabet (1988) -- Part 6: +                Latin/Arabic alphabet (1987) -- Part : Latin/Greek +                alphabet (1987) -- Part 8: Latin/Hebrew alphabet (1988) +                -- Part 9: Latin alphabet No. 5 (1989) -- Part10: Latin +                alphabet No. 6 (1992) + +   [BCP14]      Bradner, S., "Key words for use in RFCs to Indicate +                Requirement Levels", BCP 14, RFC 2119, March 1997. + + + + +Curtin                     Proposed Standard                   [Page 13] + +RFC 2640                  FTP Internalization                  July 1999 + + +   [ISO-10646]  ISO/IEC 10646-1:1993. International standard -- +                Information technology -- Universal multiple-octet coded +                character set (UCS) -- Part 1: Architecture and basic +                multilingual plane. + +   [MLST]       Elz, R. and P. Hethmon, "Extensions to FTP", Work in +                Progress. + +   [RFC854]     Postel, J. and J. Reynolds, "Telnet Protocol +                Specification", STD 8, RFC 854, May 1983. + +   [RFC959]     Postel, J. and J. Reynolds, "File Transfer Protocol +                (FTP)", STD 9, RFC 959, October 1985. + +   [RFC1123]    Braden, R., "Requirements for Internet Hosts -- +                Application and Support", STD 3, RFC 1123, October 1989. + +   [RFC1738]    Berners-Lee, T., Masinter, L. and M. McCahill, "Uniform +                Resource Locators (URL)", RFC 1738, December 1994. + +   [RFC1766]    Alvestrand, H., "Tags for the Identification of +                Languages", RFC 1766, March 1995. + +   [RFC2130]    Weider, C., Preston, C., Simonsen, K., Alvestrand, H., +                Atkinson, R., Crispin, M. and P. Svanberg, "Character +                Set Workshop Report", RFC 2130, April 1997. + +   [RFC2277]    Alvestrand, H., " IETF Policy on Character Sets and +                Languages", RFC 2277, January 1998. + +   [RFC2279]    Yergeau, F., "UTF-8, a transformation format of ISO +                10646", RFC 2279, January 1998. + +   [RFC2389]    Elz, R. and P. Hethmon, "Feature Negotiation Mechanism +                for the File Transfer Protocol", RFC 2389, August 1998. + +   [UNICODE]    The Unicode Consortium, "The Unicode Standard - Version +                2.0", Addison Westley Developers Press, July 1996. + +   [UTF-8]      ISO/IEC 10646-1:1993 AMENDMENT 2 (1996). UCS +                Transformation Format 8 (UTF-8). + + + + + + + + + + +Curtin                     Proposed Standard                   [Page 14] + +RFC 2640                  FTP Internalization                  July 1999 + + +9 Author's Address + +   Bill Curtin +   JIEO +   Attn: JEBBD +   Ft. Monmouth, N.J. 07703-5613 + +   EMail: curtinw@ftm.disa.mil + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Curtin                     Proposed Standard                   [Page 15] + +RFC 2640                  FTP Internalization                  July 1999 + + +Annex A - Implementation Considerations + +A.1 General Considerations + +   - Implementers should ensure that their code accounts for potential +     problems, such as using a NULL character to terminate a string or +     no longer being able to steal the high order bit for internal use, +     when supporting the extended character set. + +   - Implementers should be aware that there is a chance that pathnames +     that are non UTF-8 may be parsed as valid UTF-8. The probabilities +     are low for some encoding or statistically zero to zero for others. +     A recent non-scientific analysis found that EUC encoded Japanese +     words had a 2.7% false reading; SJIS had a 0.0005% false reading; +     other encoding such as ASCII or KOI-8 have a 0% false reading. This +     probability is highest for short pathnames and decreases as +     pathname size increases. Implementers may want to look for signs +     that pathnames which parse as UTF-8 are not valid UTF-8, such as +     the existence of multiple local character sets in short pathnames. +     Hopefully, as more implementations conform to UTF-8 transfer +     encoding there will be a smaller need to guess at the encoding. + +   - Client developers should be aware that it will be possible for +     pathnames to contain mixed characters (e.g. +     //Latin1DirectoryName/HebrewFileName). They should be prepared to +     handle the Bi-directional (BIDI) display of these character sets +     (i.e. right to left display for the directory and left to right +     display for the filename). While bi-directional display is outside +     the scope of this document and more complicated than the above +     example, an algorithm for bi-directional display can be found in +     the UNICODE 2.0 [UNICODE] standard. Also note that pathnames can +     have different byte ordering yet be logically and display-wise +     equivalent due to the insertion of BIDI control characters at +     different points during composition. Also note that mixed character +     sets may also present problems with font swapping. + +   - A server that copies pathnames transparently from a local +     filesystem may continue to do so. It is then up to the local file +     creators to use UTF-8 pathnames. + +   - Servers can supports charset labeling of files and/or directories, +     such that different pathnames may have different charsets. The +     server should attempt to convert all pathnames to UTF-8, but if it +     can't then it should leave that name in its raw form. + +   - Some server's OS do not mandate character sets, but allow +     administrators to configure it in the FTP server. These servers +     should be configured to use a particular mapping table (either + + + +Curtin                     Proposed Standard                   [Page 16] + +RFC 2640                  FTP Internalization                  July 1999 + + +     external or built-in). This will allow the flexibility of defining +     different charsets for different directories. + +   - If the server's OS does not mandate the character set and the FTP +     server cannot be configured, the server should simply use the raw +     bytes in the file name.  They might be ASCII or UTF-8. + +   - If the server is a mirror, and wants to look just like the site it +     is mirroring, it should store the exact file name bytes that it +     received from the main server. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Curtin                     Proposed Standard                   [Page 17] + +RFC 2640                  FTP Internalization                  July 1999 + + +A.2 Transition Considerations + +   - Servers which support this specification, when presented a pathname +     from an old client (one which does not support this specification), +     can nearly always tell whether the pathname is in UTF-8 (see B.1) +     or in some other code set. In order to support these older clients, +     servers may wish to default to a non UTF-8 code set. However, how a +     server supports non UTF-8 is outside the scope of this +     specification. + +   - Clients which support this specification will be able to determine +     if the server can support UTF-8 (i.e. supports this specification) +     by the ability of the server to support the FEAT command and the +     UTF8 feature (defined in 3.2). If the newer clients determine that +     the server does not support UTF-8 it may wish to default to a +     different code set. Client developers should take into +     consideration that pathnames, associated with older servers, might +     be stored in UTF-8. However, how a client supports non UTF-8 is +     outside the scope of this specification. + +   - Clients and servers can transition to UTF-8 by either converting +     to/from the local encoding, or the users can store UTF-8 filenames. +     The former approach is easier on tightly controlled file systems +     (e.g. PCs and MACs). The latter approach is easier on more free +     form file systems (e.g. Unix). + +   - For interactive use attention should be focused on user interface +     and ease of use. Non-interactive use requires a consistent and +     controlled behavior. + +   - There may be many applications which reference files under their +     old raw pathname (e.g. linked URLs). Changing the pathname to UTF-8 +     will cause access to the old URL to fail. A solution may be for the +     server to act as if there was 2 different pathnames associated with +     the file. This might be done internal to the server on controlled +     file systems or by using symbolic links on free form systems. While +     this approach may work for single file transfer non-interactive +     use, a non-interactive transfer of all of the files in a directory +     will produce duplicates. Interactive users may be presented with +     lists of files which are double the actual number files. + + + + + + + + + + + +Curtin                     Proposed Standard                   [Page 18] + +RFC 2640                  FTP Internalization                  July 1999 + + +Annex B - Sample Code and Examples + +B.1 Valid UTF-8 check + +   The following routine checks if a byte sequence is valid UTF-8. This +   is done by checking for the proper tagging of the first and following +   bytes to make sure they conform to the UTF-8 format. It then checks +   to assure that the data part of the UTF-8 sequence conforms to the +   proper range allowed by the encoding. Note: This routine will not +   detect characters that have not been assigned and therefore do not +   exist. + +int utf8_valid(const unsigned char *buf, unsigned int len) +{ + const unsigned char *endbuf = buf + len; + unsigned char byte2mask=0x00, c; + int trailing = 0;  // trailing (continuation) bytes to follow + + while (buf != endbuf) + { +   c = *buf++; +   if (trailing) +    if ((c&0xC0) == 0x80)  // Does trailing byte follow UTF-8 format? +    {if (byte2mask)        // Need to check 2nd byte for proper range? +      if (c&byte2mask)     // Are appropriate bits set? +       byte2mask=0x00; +      else +       return 0; +     trailing--; } +    else +     return 0; +   else +    if ((c&0x80) == 0x00)  continue;      // valid 1 byte UTF-8 +    else if ((c&0xE0) == 0xC0)            // valid 2 byte UTF-8 +          if (c&0x1E)                     // Is UTF-8 byte in +                                          // proper range? +           trailing =1; +          else +           return 0; +    else if ((c&0xF0) == 0xE0)           // valid 3 byte UTF-8 +          {if (!(c&0x0F))                // Is UTF-8 byte in +                                         // proper range? +            byte2mask=0x20;              // If not set mask +                                         // to check next byte +            trailing = 2;} +    else if ((c&0xF8) == 0xF0)           // valid 4 byte UTF-8 +          {if (!(c&0x07))                // Is UTF-8 byte in +                                         // proper range? + + + +Curtin                     Proposed Standard                   [Page 19] + +RFC 2640                  FTP Internalization                  July 1999 + + +            byte2mask=0x30;              // If not set mask +                                         // to check next byte +            trailing = 3;} +    else if ((c&0xFC) == 0xF8)           // valid 5 byte UTF-8 +          {if (!(c&0x03))                // Is UTF-8 byte in +                                         // proper range? +            byte2mask=0x38;              // If not set mask +                                         // to check next byte +            trailing = 4;} +    else if ((c&0xFE) == 0xFC)           // valid 6 byte UTF-8 +          {if (!(c&0x01))                // Is UTF-8 byte in +                                         // proper range? +            byte2mask=0x3C;              // If not set mask +                                         // to check next byte +            trailing = 5;} +    else  return 0; + } +  return trailing == 0; +} + +B.2 Conversions + +   The code examples in this section closely reflect the algorithm in +   ISO 10646 and may not present the most efficient solution for +   converting to / from UTF-8 encoding. If efficiency is an issue, +   implementers should use the appropriate bitwise operators. + +   Additional code examples and numerous mapping tables can be found at +   the Unicode site, HTTP://www.unicode.org or FTP://unicode.org. + +   Note that the conversion examples below assume that the local +   character set supported in the operating system is something other +   than UCS2/UTF-16. There are some operating systems that already +   support UCS2/UTF-16 (notably Plan 9 and Windows NT). In this case no +   conversion will be necessary from the local character set to the UCS. + +B.2.1 Conversion from Local Character Set to UTF-8 + +   Conversion from the local filesystem character set to UTF-8 will +   normally involve a two step process. First convert the local +   character set to the UCS; then convert the UCS to UTF-8. + +   The first step in the process can be performed by maintaining a +   mapping table that includes the local character set code and the +   corresponding UCS code. For instance the ISO/IEC 8859-8 [ISO-8859] +   code for the Hebrew letter "VAV" is 0xE4. The corresponding 4 byte +   ISO/IEC 10646 code is 0x000005D5. + + + + +Curtin                     Proposed Standard                   [Page 20] + +RFC 2640                  FTP Internalization                  July 1999 + + +   The next step is to convert the UCS character code to the UTF-8 +   encoding. The following routine can be used to determine and encode +   the correct number of bytes based on the UCS-4 character code: + +   unsigned int ucs4_to_utf8 (unsigned long *ucs4_buf, unsigned int +                              ucs4_len, unsigned char *utf8_buf) + +   { +    const unsigned long *ucs4_endbuf = ucs4_buf + ucs4_len; +    unsigned int utf8_len = 0;        // return value for UTF8 size +    unsigned char *t_utf8_buf = utf8_buf; // Temporary pointer +                                          // to load UTF8 values + +    while (ucs4_buf != ucs4_endbuf) +    { +     if ( *ucs4_buf <= 0x7F)    // ASCII chars no conversion needed +     { +      *t_utf8_buf++ = (unsigned char) *ucs4_buf; +      utf8_len++; +      ucs4_buf++; +     } +     else +      if ( *ucs4_buf <= 0x07FF ) // In the 2 byte utf-8 range +      { +        *t_utf8_buf++= (unsigned char) (0xC0 + (*ucs4_buf/0x40)); +        *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40)); +        utf8_len+=2; +        ucs4_buf++; +      } +      else +        if ( *ucs4_buf <= 0xFFFF ) /* In the 3 byte utf-8 range. The +                                    values 0x0000FFFE, 0x0000FFFF +                                    and 0x0000D800 - 0x0000DFFF do +                                    not occur in UCS-4 */ +        { +         *t_utf8_buf++= (unsigned char) (0xE0 + +                        (*ucs4_buf/0x1000)); +         *t_utf8_buf++= (unsigned char) (0x80 + +                        ((*ucs4_buf/0x40)%0x40)); +         *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40)); +         utf8_len+=3; +         ucs4_buf++; +         } +        else +         if ( *ucs4_buf <= 0x1FFFFF ) //In the 4 byte utf-8 range +         { +          *t_utf8_buf++= (unsigned char) (0xF0 + +                         (*ucs4_buf/0x040000)); + + + +Curtin                     Proposed Standard                   [Page 21] + +RFC 2640                  FTP Internalization                  July 1999 + + +          *t_utf8_buf++= (unsigned char) (0x80 + +                         ((*ucs4_buf/0x10000)%0x40)); +          *t_utf8_buf++= (unsigned char) (0x80 + +                         ((*ucs4_buf/0x40)%0x40)); +          *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40)); +          utf8_len+=4; +          ucs4_buf++; + +         } +         else +          if ( *ucs4_buf <= 0x03FFFFFF )//In the 5 byte utf-8 range +          { +           *t_utf8_buf++= (unsigned char) (0xF8 + +                          (*ucs4_buf/0x01000000)); +           *t_utf8_buf++= (unsigned char) (0x80 + +                          ((*ucs4_buf/0x040000)%0x40)); +           *t_utf8_buf++= (unsigned char) (0x80 + +                          ((*ucs4_buf/0x1000)%0x40)); +           *t_utf8_buf++= (unsigned char) (0x80 + +                          ((*ucs4_buf/0x40)%0x40)); +           *t_utf8_buf++= (unsigned char) (0x80 + +                          (*ucs4_buf%0x40)); +           utf8_len+=5; +           ucs4_buf++; +          } +          else +          if ( *ucs4_buf <= 0x7FFFFFFF )//In the 6 byte utf-8 range +           { +             *t_utf8_buf++= (unsigned char) +                            (0xF8 +(*ucs4_buf/0x40000000)); +             *t_utf8_buf++= (unsigned char) (0x80 + +                            ((*ucs4_buf/0x01000000)%0x40)); +             *t_utf8_buf++= (unsigned char) (0x80 + +                            ((*ucs4_buf/0x040000)%0x40)); +             *t_utf8_buf++= (unsigned char) (0x80 + +                            ((*ucs4_buf/0x1000)%0x40)); +             *t_utf8_buf++= (unsigned char) (0x80 + +                            ((*ucs4_buf/0x40)%0x40)); +             *t_utf8_buf++= (unsigned char) (0x80 + +                            (*ucs4_buf%0x40)); +             utf8_len+=6; +             ucs4_buf++; + +           } +    } +    return (utf8_len); +   } + + + + +Curtin                     Proposed Standard                   [Page 22] + +RFC 2640                  FTP Internalization                  July 1999 + + +B.2.2 Conversion from UTF-8 to Local Character Set + +   When moving from UTF-8 encoding to the local character set the +   reverse procedure is used. First the UTF-8 encoding is transformed +   into the UCS-4 character set. The UCS-4 is then converted to the +   local character set from a mapping table (i.e. the opposite of the +   table used to form the UCS-4 character code). + +   To convert from UTF-8 to UCS-4 the free bits (those that do not +   define UTF-8 sequence size or signify continuation bytes) in a UTF-8 +   sequence are concatenated as a bit string. The bits are then +   distributed into a four-byte sequence starting from the least +   significant bits. Those bits not assigned a bit in the four-byte +   sequence are padded with ZERO bits. The following routine converts +   the UTF-8 encoding to UCS-4 character codes: + +   int utf8_to_ucs4 (unsigned long *ucs4_buf, unsigned int utf8_len, +                     unsigned char *utf8_buf) +   { + +   const unsigned char *utf8_endbuf = utf8_buf + utf8_len; +   unsigned int ucs_len=0; + +    while (utf8_buf != utf8_endbuf) +    { + +     if ((*utf8_buf & 0x80) == 0x00)  /*ASCII chars no conversion +                                        needed */ +     { +      *ucs4_buf++ = (unsigned long) *utf8_buf; +      utf8_buf++; +      ucs_len++; +     } +     else +      if ((*utf8_buf & 0xE0)== 0xC0) //In the 2 byte utf-8 range +      { +        *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xC0) * 0x40) +                       + ( *(utf8_buf+1) - 0x80)); +        utf8_buf += 2; +        ucs_len++; +      } +      else +        if ( (*utf8_buf & 0xF0) == 0xE0 ) /*In the 3 byte utf-8 +                                            range */ +        { +        *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xE0) * 0x1000) +                      + (( *(utf8_buf+1) -  0x80) * 0x40) +                      + ( *(utf8_buf+2) - 0x80)); + + + +Curtin                     Proposed Standard                   [Page 23] + +RFC 2640                  FTP Internalization                  July 1999 + + +         utf8_buf+=3; +         ucs_len++; +        } +        else +         if ((*utf8_buf & 0xF8) == 0xF0) /* In the 4 byte utf-8 +                                            range */ +         { +          *ucs4_buf++ = (unsigned long) +                          (((*utf8_buf - 0xF0) * 0x040000) +                          + (( *(utf8_buf+1) -  0x80) * 0x1000) +                          + (( *(utf8_buf+2) -  0x80) * 0x40) +                          + ( *(utf8_buf+3) - 0x80)); +          utf8_buf+=4; +          ucs_len++; +         } +         else +          if ((*utf8_buf & 0xFC) == 0xF8) /* In the 5 byte utf-8 +                                             range */ +          { +           *ucs4_buf++ = (unsigned long) +                          (((*utf8_buf - 0xF8) * 0x01000000) +                          + ((*(utf8_buf+1) - 0x80) * 0x040000) +                          + (( *(utf8_buf+2) -  0x80) * 0x1000) +                          + (( *(utf8_buf+3) -  0x80) * 0x40) +                          + ( *(utf8_buf+4) - 0x80)); +           utf8_buf+=5; +           ucs_len++; +          } +          else +           if ((*utf8_buf & 0xFE) == 0xFC) /* In the 6 byte utf-8 +                                              range */ +           { +             *ucs4_buf++ = (unsigned long) +                           (((*utf8_buf - 0xFC) * 0x40000000) +                            + ((*(utf8_buf+1) - 0x80) * 0x010000000) +                            + ((*(utf8_buf+2) - 0x80) * 0x040000) +                            + (( *(utf8_buf+3) -  0x80) * 0x1000) +                            + (( *(utf8_buf+4) -  0x80) * 0x40) +                            + ( *(utf8_buf+5) - 0x80)); +             utf8_buf+=6; +             ucs_len++; +           } + +    } +   return (ucs_len); +   } + + + + + +Curtin                     Proposed Standard                   [Page 24] + +RFC 2640                  FTP Internalization                  July 1999 + + +B.2.3 ISO/IEC 8859-8 Example + +   This example demonstrates mapping ISO/IEC 8859-8 character set to +   UTF-8 and back to ISO/IEC 8859-8. As noted earlier, the Hebrew letter +   "VAV" is convertd from the ISO/IEC 8859-8 character code 0xE4 to the +   corresponding 4 byte ISO/IEC 10646 code of 0x000005D5 by a simple +   lookup of a conversion/mapping file. + +   The UCS-4 character code is transformed into UTF-8 using the +   ucs4_to_utf8 routine described earlier by: + +   1. Because the UCS-4 character is between 0x80 and 0x07FF it will map +      to a 2 byte UTF-8 sequence. +   2. The first byte is defined by (0xC0 + (0x000005D5 / 0x40)) = 0xD7. + +   3. The second byte is defined by (0x80 + (0x000005D5 % 0x40)) = 0x95. + +   The UTF-8 encoding is transferred back to UCS-4 by using the +   utf8_to_ucs4 routine described earlier by: + +   1. Because the first byte of the sequence, when the '&' operator with +      a value of 0xE0 is applied, will produce 0xC0 (0xD7 & 0xE0 = 0xC0) +      the UTF-8 is a 2 byte sequence. +   2. The four byte UCS-4 character code is produced by (((0xD7 - 0xC0) +      * 0x40) + (0x95 -0x80)) = 0x000005D5. + +   Finally, the UCS-4 character code is converted to ISO/IEC 8859-8 +   character code (using the mapping table which matches ISO/IEC 8859-8 +   to UCS-4 ) to produce the original 0xE4 code for the Hebrew letter +   "VAV". + +B.2.4 Vendor Codepage Example + +   This example demonstrates the mapping of a codepage to UTF-8 and back +   to a vendor codepage. Mapping between vendor codepages can be done in +   a very similar manner as described above. For instance both the PC +   and Mac codepages reflect the character set from the Thai standard +   TIS 620-2533. The character code on both platforms for the Thai +   letter "SO SO" is 0xAB. This character can then be mapped into the +   UCS-4 by way of a conversion/mapping file to produce the UCS-4 code +   of 0x0E0B. + +   The UCS-4 character code is transformed into UTF-8 using the +   ucs4_to_utf8 routine described earlier by: + +   1. Because the UCS-4 character is between 0x0800 and 0xFFFF it will +      map to a 3 byte UTF-8 sequence. +   2. The first byte is defined by (0xE0 + (0x00000E0B / 0x1000) = 0xE0. + + + +Curtin                     Proposed Standard                   [Page 25] + +RFC 2640                  FTP Internalization                  July 1999 + + +   3. The second byte is defined by (0x80 + ((0x00000E0B / 0x40) % +      0x40))) = 0xB8. +   4. The third byte is defined by (0x80 + (0x00000E0B % 0x40)) = 0x8B. + +   The UTF-8 encoding is transferred back to UCS-4 by using the +   utf8_to_ucs4 routine described earlier by: + +   1. Because the first byte of the sequence, when the '&' operator with +      a value of 0xF0 is applied, will produce 0xE0 (0xE0 & 0xF0 = 0xE0) +      the UTF-8 is a 3 byte sequence. +   2. The four byte UCS-4 character code is produced by (((0xE0 - 0xE0) +      * 0x1000) + ((0xB8 - 0x80) * 0x40) + (0x8B -0x80) = 0x0000E0B. + +   Finally, the UCS-4 character code is converted to either the PC or +   MAC codepage character code (using the mapping table which matches +   codepage to UCS-4 ) to produce the original 0xAB code for the Thai +   letter "SO SO". + +B.3 Pseudo Code for a High-Quality Translating Server + +   if utf8_valid(fn) +     { +     attempt to convert fn to the local charset, producing localfn +     if (conversion fails temporarily) return error +     if (conversion succeeds) +     { +       attempt to open localfn +       if (open fails temporarily) return error +       if (open succeeds) return success +     } +     } +   attempt to open fn +   if (open fails temporarily) return error +   if (open succeeds) return success +   return permanent error + + + + + + + + + + + + + + + + +Curtin                     Proposed Standard                   [Page 26] + +RFC 2640                  FTP Internalization                  July 1999 + + +Full Copyright Statement + +   Copyright (C) The Internet Society (1999).  All Rights Reserved. + +   This document and translations of it may be copied and furnished to +   others, and derivative works that comment on or otherwise explain it +   or assist in its implementation may be prepared, copied, published +   and distributed, in whole or in part, without restriction of any +   kind, provided that the above copyright notice and this paragraph are +   included on all such copies and derivative works.  However, this +   document itself may not be modified in any way, such as by removing +   the copyright notice or references to the Internet Society or other +   Internet organizations, except as needed for the purpose of +   developing Internet standards in which case the procedures for +   copyrights defined in the Internet Standards process must be +   followed, or as required to translate it into languages other than +   English. + +   The limited permissions granted above are perpetual and will not be +   revoked by the Internet Society or its successors or assigns. + +   This document and the information contained herein is provided on an +   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING +   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING +   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION +   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF +   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. + +Acknowledgement + +   Funding for the RFC Editor function is currently provided by the +   Internet Society. + + + + + + + + + + + + + + + + + + + +Curtin                     Proposed Standard                   [Page 27] +  |