diff options
author | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 |
---|---|---|
committer | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 |
commit | 4bfd864f10b68b71482b35c818559068ef8d5797 (patch) | |
tree | e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc5987.txt | |
parent | ea76e11061bda059ae9f9ad130a9895cc85607db (diff) |
doc: Add RFC documents
Diffstat (limited to 'doc/rfc/rfc5987.txt')
-rw-r--r-- | doc/rfc/rfc5987.txt | 563 |
1 files changed, 563 insertions, 0 deletions
diff --git a/doc/rfc/rfc5987.txt b/doc/rfc/rfc5987.txt new file mode 100644 index 0000000..37cac39 --- /dev/null +++ b/doc/rfc/rfc5987.txt @@ -0,0 +1,563 @@ + + + + + + +Internet Engineering Task Force (IETF) J. Reschke +Request for Comments: 5987 greenbytes +Category: Standards Track August 2010 +ISSN: 2070-1721 + + + Character Set and Language Encoding for + Hypertext Transfer Protocol (HTTP) Header Field Parameters + +Abstract + + By default, message header field parameters in Hypertext Transfer + Protocol (HTTP) messages cannot carry characters outside the ISO- + 8859-1 character set. RFC 2231 defines an encoding mechanism for use + in Multipurpose Internet Mail Extensions (MIME) headers. This + document specifies an encoding suitable for use in HTTP header fields + that is compatible with a profile of the encoding defined in RFC + 2231. + +Status of This Memo + + This is an Internet Standards Track document. + + This document is a product of the Internet Engineering Task Force + (IETF). It represents the consensus of the IETF community. It has + received public review and has been approved for publication by the + Internet Engineering Steering Group (IESG). Further information on + Internet Standards is available in Section 2 of RFC 5741. + + Information about the current status of this document, any errata, + and how to provide feedback on it may be obtained at + http://www.rfc-editor.org/info/rfc5987. + +Copyright Notice + + Copyright (c) 2010 IETF Trust and the persons identified as the + document authors. All rights reserved. + + This document is subject to BCP 78 and the IETF Trust's Legal + Provisions Relating to IETF Documents + (http://trustee.ietf.org/license-info) in effect on the date of + publication of this document. Please review these documents + carefully, as they describe your rights and restrictions with respect + to this document. Code Components extracted from this document must + include Simplified BSD License text as described in Section 4.e of + the Trust Legal Provisions and are provided without warranty as + described in the Simplified BSD License. + + + + +Reschke Standards Track [Page 1] + +RFC 5987 Charset/Language Encoding in HTTP August 2010 + + +Table of Contents + + 1. Introduction ....................................................2 + 2. Notational Conventions ..........................................2 + 3. Comparison to RFC 2231 and Definition of the Encoding ...........3 + 3.1. Parameter Continuations ....................................3 + 3.2. Parameter Value Character Set and Language Information .....3 + 3.2.1. Definition ..........................................3 + 3.2.2. Examples ............................................6 + 3.3. Language Specification in Encoded Words ....................6 + 4. Guidelines for Usage in HTTP Header Field Definitions ...........7 + 4.1. When to Use the Extension ..................................7 + 4.2. Error Handling .............................................7 + 5. Security Considerations .........................................8 + 6. Acknowledgements ................................................8 + 7. References ......................................................8 + 7.1. Normative References .......................................8 + 7.2. Informative References .....................................9 + +1. Introduction + + By default, message header field parameters in HTTP ([RFC2616]) + messages cannot carry characters outside the ISO-8859-1 character set + ([ISO-8859-1]). RFC 2231 ([RFC2231]) defines an encoding mechanism + for use in MIME headers. This document specifies an encoding + suitable for use in HTTP header fields that is compatible with a + profile of the encoding defined in RFC 2231. + + Note: in the remainder of this document, RFC 2231 is only + referenced for the purpose of explaining the choice of features + that were adopted; they are therefore purely informative. + + Note: this encoding does not apply to message payloads transmitted + over HTTP, such as when using the media type "multipart/form-data" + ([RFC2388]). + +2. Notational Conventions + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", + "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this + document are to be interpreted as described in [RFC2119]. + + This specification uses the ABNF (Augmented Backus-Naur Form) + notation defined in [RFC5234]. The following core rules are included + by reference, as defined in [RFC5234], Appendix B.1: ALPHA (letters), + DIGIT (decimal 0-9), HEXDIG (hexadecimal 0-9/A-F/a-f), and LWSP + (linear whitespace). + + + + +Reschke Standards Track [Page 2] + +RFC 5987 Charset/Language Encoding in HTTP August 2010 + + + Note that this specification uses the term "character set" for + consistency with other IETF specifications such as RFC 2277 (see + [RFC2277], Section 3). A more accurate term would be "character + encoding" (a mapping of code points to octet sequences). + +3. Comparison to RFC 2231 and Definition of the Encoding + + RFC 2231 defines several extensions to MIME. The sections below + discuss if and how they apply to HTTP header fields. + + In short: + + o Parameter Continuations aren't needed (Section 3.1), + + o Character Set and Language Information are useful, therefore a + simple subset is specified (Section 3.2), and + + o Language Specifications in Encoded Words aren't needed + (Section 3.3). + +3.1. Parameter Continuations + + Section 3 of [RFC2231] defines a mechanism that deals with the length + limitations that apply to MIME headers. These limitations do not + apply to HTTP ([RFC2616], Section 19.4.7). + + Thus, parameter continuations are not part of the encoding defined by + this specification. + +3.2. Parameter Value Character Set and Language Information + + Section 4 of [RFC2231] specifies how to embed language information + into parameter values, and also how to encode non-ASCII characters, + dealing with restrictions both in MIME and HTTP header parameters. + + However, RFC 2231 does not specify a mandatory-to-implement character + set, making it hard for senders to decide which character set to use. + Thus, recipients implementing this specification MUST support the + character sets "ISO-8859-1" [ISO-8859-1] and "UTF-8" [RFC3629]. + + Furthermore, RFC 2231 allows the character set information to be left + out. The encoding defined by this specification does not allow that. + +3.2.1. Definition + + The syntax for parameters is defined in Section 3.6 of [RFC2616] + (with RFC 2616 implied LWS translated to RFC 5234 LWSP): + + + + +Reschke Standards Track [Page 3] + +RFC 5987 Charset/Language Encoding in HTTP August 2010 + + + parameter = attribute LWSP "=" LWSP value + + attribute = token + value = token / quoted-string + + quoted-string = <quoted-string, defined in [RFC2616], Section 2.2> + token = <token, defined in [RFC2616], Section 2.2> + + In order to include character set and language information, this + specification modifies the RFC 2616 grammar to be: + + parameter = reg-parameter / ext-parameter + + reg-parameter = parmname LWSP "=" LWSP value + + ext-parameter = parmname "*" LWSP "=" LWSP ext-value + + parmname = 1*attr-char + + ext-value = charset "'" [ language ] "'" value-chars + ; like RFC 2231's <extended-initial-value> + ; (see [RFC2231], Section 7) + + charset = "UTF-8" / "ISO-8859-1" / mime-charset + + mime-charset = 1*mime-charsetc + mime-charsetc = ALPHA / DIGIT + / "!" / "#" / "$" / "%" / "&" + / "+" / "-" / "^" / "_" / "`" + / "{" / "}" / "~" + ; as <mime-charset> in Section 2.3 of [RFC2978] + ; except that the single quote is not included + ; SHOULD be registered in the IANA charset registry + + language = <Language-Tag, defined in [RFC5646], Section 2.1> + + value-chars = *( pct-encoded / attr-char ) + + pct-encoded = "%" HEXDIG HEXDIG + ; see [RFC3986], Section 2.1 + + attr-char = ALPHA / DIGIT + / "!" / "#" / "$" / "&" / "+" / "-" / "." + / "^" / "_" / "`" / "|" / "~" + ; token except ( "*" / "'" / "%" ) + + + + + + +Reschke Standards Track [Page 4] + +RFC 5987 Charset/Language Encoding in HTTP August 2010 + + + Thus, a parameter is either a regular parameter (reg-parameter), as + previously defined in Section 3.6 of [RFC2616], or an extended + parameter (ext-parameter). + + Extended parameters are those where the left-hand side of the + assignment ends with an asterisk character. + + The value part of an extended parameter (ext-value) is a token that + consists of three parts: the REQUIRED character set name (charset), + the OPTIONAL language information (language), and a character + sequence representing the actual value (value-chars), separated by + single quote characters. Note that both character set names and + language tags are restricted to the US-ASCII character set, and are + matched case-insensitively (see [RFC2978], Section 2.3 and [RFC5646], + Section 2.1.1). + + Inside the value part, characters not contained in attr-char are + encoded into an octet sequence using the specified character set. + That octet sequence is then percent-encoded as specified in Section + 2.1 of [RFC3986]. + + Producers MUST use either the "UTF-8" ([RFC3629]) or the "ISO-8859-1" + ([ISO-8859-1]) character set. Extension character sets (mime- + charset) are reserved for future use. + + Note: recipients should be prepared to handle encoding errors, + such as malformed or incomplete percent escape sequences, or non- + decodable octet sequences, in a robust manner. This specification + does not mandate any specific behavior, for instance, the + following strategies are all acceptable: + + * ignoring the parameter, + + * stripping a non-decodable octet sequence, + + * substituting a non-decodable octet sequence by a replacement + character, such as the Unicode character U+FFFD (Replacement + Character). + + Note: the RFC 2616 token production ([RFC2616], Section 2.2) + differs from the production used in RFC 2231 (imported from + Section 5.1 of [RFC2045]) in that curly braces ("{" and "}") are + excluded. Thus, these two characters are excluded from the attr- + char production as well. + + + + + + + +Reschke Standards Track [Page 5] + +RFC 5987 Charset/Language Encoding in HTTP August 2010 + + + Note: the <mime-charset> ABNF defined here differs from the one in + Section 2.3 of [RFC2978] in that it does not allow the single + quote character (see also RFC Errata ID 1912 [Err1912]). In + practice, no character set names using that character have been + registered at the time of this writing. + +3.2.2. Examples + + Non-extended notation, using "token": + + foo: bar; title=Economy + + Non-extended notation, using "quoted-string": + + foo: bar; title="US-$ rates" + + Extended notation, using the Unicode character U+00A3 (POUND SIGN): + + foo: bar; title*=iso-8859-1'en'%A3%20rates + + Note: the Unicode pound sign character U+00A3 was encoded into the + single octet A3 using the ISO-8859-1 character encoding, then + percent-encoded. Also, note that the space character was encoded as + %20, as it is not contained in attr-char. + + Extended notation, using the Unicode characters U+00A3 (POUND SIGN) + and U+20AC (EURO SIGN): + + foo: bar; title*=UTF-8''%c2%a3%20and%20%e2%82%ac%20rates + + Note: the Unicode pound sign character U+00A3 was encoded into the + octet sequence C2 A3 using the UTF-8 character encoding, then + percent-encoded. Likewise, the Unicode euro sign character U+20AC + was encoded into the octet sequence E2 82 AC, then percent-encoded. + Also note that HEXDIG allows both lowercase and uppercase characters, + so recipients must understand both, and that the language information + is optional, while the character set is not. + +3.3. Language Specification in Encoded Words + + Section 5 of [RFC2231] extends the encoding defined in [RFC2047] to + also support language specification in encoded words. Although the + HTTP/1.1 specification does refer to RFC 2047 ([RFC2616], Section + 2.2), it's not clear to which header field exactly it applies, and + whether it is implemented in practice (see + <http://tools.ietf.org/wg/httpbis/trac/ticket/111> for details). + + Thus, this specification does not include this feature. + + + +Reschke Standards Track [Page 6] + +RFC 5987 Charset/Language Encoding in HTTP August 2010 + + +4. Guidelines for Usage in HTTP Header Field Definitions + + Specifications of HTTP header fields that use the extensions defined + in Section 3.2 ought to clearly state that. A simple way to achieve + this is to normatively reference this specification, and to include + the ext-value production into the ABNF for that header field. + + For instance: + + foo-header = "foo" LWSP ":" LWSP token ";" LWSP title-param + title-param = "title" LWSP "=" LWSP value + / "title*" LWSP "=" LWSP ext-value + ext-value = <see RFC 5987, Section 3.2> + + Note: The Parameter Value Continuation feature defined in Section + 3 of [RFC2231] makes it impossible to have multiple instances of + extended parameters with identical parmname components, as the + processing of continuations would become ambiguous. Thus, + specifications using this extension are advised to disallow this + case for compatibility with RFC 2231. + +4.1. When to Use the Extension + + Section 4.2 of [RFC2277] requires that protocol elements containing + human-readable text are able to carry language information. Thus, + the ext-value production ought to be always used when the parameter + value is of textual nature and its language is known. + + Furthermore, the extension ought to also be used whenever the + parameter value needs to carry characters not present in the US-ASCII + ([USASCII]) character set (note that it would be unacceptable to + define a new parameter that would be restricted to a subset of the + Unicode character set). + +4.2. Error Handling + + Header field specifications need to define whether multiple instances + of parameters with identical parmname components are allowed, and how + they should be processed. This specification suggests that a + parameter using the extended syntax takes precedence. This would + allow producers to use both formats without breaking recipients that + do not understand the extended syntax yet. + + Example: + + foo: bar; title="EURO exchange rates"; + title*=utf-8''%e2%82%ac%20exchange%20rates + + + + +Reschke Standards Track [Page 7] + +RFC 5987 Charset/Language Encoding in HTTP August 2010 + + + In this case, the sender provides an ASCII version of the title for + legacy recipients, but also includes an internationalized version for + recipients understanding this specification -- the latter obviously + ought to prefer the new syntax over the old one. + + Note: at the time of this writing, many implementations failed to + ignore the form they do not understand, or prioritize the ASCII + form although the extended syntax was present. + +5. Security Considerations + + The format described in this document makes it possible to transport + non-ASCII characters, and thus enables character "spoofing" + scenarios, in which a displayed value appears to be something other + than it is. + + Furthermore, there are known attack scenarios relating to decoding + UTF-8. + + See Section 10 of [RFC3629] for more information on both topics. + + In addition, the extension specified in this document makes it + possible to transport multiple language variants for a single + parameter, and such use might allow spoofing attacks, where different + language versions of the same parameter are not equivalent. Whether + this attack is useful as an attack depends on the parameter + specified. + +6. Acknowledgements + + Thanks to Martin Duerst and Frank Ellermann for help figuring out + ABNF details, to Graham Klyne and Alexey Melnikov for general review, + to Chris Newman for pointing out an RFC 2231 incompatibility, and to + Benjamin Carlyle and Roar Lauritzsen for implementer's feedback. + +7. References + +7.1. Normative References + + [ISO-8859-1] International Organization for Standardization, + "Information technology -- 8-bit single-byte coded + graphic character sets -- Part 1: Latin alphabet No. + 1", ISO/IEC 8859-1:1998, 1998. + + [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate + Requirement Levels", BCP 14, RFC 2119, March 1997. + + + + + +Reschke Standards Track [Page 8] + +RFC 5987 Charset/Language Encoding in HTTP August 2010 + + + [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., + Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext + Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. + + [RFC2978] Freed, N. and J. Postel, "IANA Charset Registration + Procedures", BCP 19, RFC 2978, October 2000. + + [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO + 10646", RFC 3629, STD 63, November 2003. + + [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, + "Uniform Resource Identifier (URI): Generic Syntax", + RFC 3986, STD 66, January 2005. + + [RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for + Syntax Specifications: ABNF", STD 68, RFC 5234, + January 2008. + + [RFC5646] Phillips, A., Ed. and M. Davis, Ed., "Tags for + Identifying Languages", BCP 47, RFC 5646, + September 2009. + + [USASCII] American National Standards Institute, "Coded Character + Set -- 7-bit American Standard Code for Information + Interchange", ANSI X3.4, 1986. + +7.2. Informative References + + [Err1912] RFC Errata, Errata ID 1912, RFC 2978, + <http://www.rfc-editor.org>. + + [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet + Mail Extensions (MIME) Part One: Format of Internet + Message Bodies", RFC 2045, November 1996. + + [RFC2047] Moore, K., "MIME (Multipurpose Internet Mail + Extensions) Part Three: Message Header Extensions for + Non-ASCII Text", RFC 2047, November 1996. + + [RFC2231] Freed, N. and K. Moore, "MIME Parameter Value and + Encoded Word Extensions: Character Sets, Languages, and + Continuations", RFC 2231, November 1997. + + [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and + Languages", BCP 18, RFC 2277, January 1998. + + [RFC2388] Masinter, L., "Returning Values from Forms: multipart/ + form-data", RFC 2388, August 1998. + + + +Reschke Standards Track [Page 9] + +RFC 5987 Charset/Language Encoding in HTTP August 2010 + + +Author's Address + + Julian F. Reschke + greenbytes GmbH + Hafenweg 16 + Muenster, NW 48155 + Germany + + EMail: julian.reschke@greenbytes.de + URI: http://greenbytes.de/tech/webdav/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Reschke Standards Track [Page 10] + |