diff options
author | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 |
---|---|---|
committer | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 |
commit | 4bfd864f10b68b71482b35c818559068ef8d5797 (patch) | |
tree | e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc2396.txt | |
parent | ea76e11061bda059ae9f9ad130a9895cc85607db (diff) |
doc: Add RFC documents
Diffstat (limited to 'doc/rfc/rfc2396.txt')
-rw-r--r-- | doc/rfc/rfc2396.txt | 2243 |
1 files changed, 2243 insertions, 0 deletions
diff --git a/doc/rfc/rfc2396.txt b/doc/rfc/rfc2396.txt new file mode 100644 index 0000000..5bd5211 --- /dev/null +++ b/doc/rfc/rfc2396.txt @@ -0,0 +1,2243 @@ + + + + + + +Network Working Group T. Berners-Lee +Request for Comments: 2396 MIT/LCS +Updates: 1808, 1738 R. Fielding +Category: Standards Track U.C. Irvine + L. Masinter + Xerox Corporation + August 1998 + + + Uniform Resource Identifiers (URI): Generic Syntax + +Status of this Memo + + This document specifies an Internet standards track protocol for the + Internet community, and requests discussion and suggestions for + improvements. Please refer to the current edition of the "Internet + Official Protocol Standards" (STD 1) for the standardization state + and status of this protocol. Distribution of this memo is unlimited. + +Copyright Notice + + Copyright (C) The Internet Society (1998). All Rights Reserved. + +IESG Note + + This paper describes a "superset" of operations that can be applied + to URI. It consists of both a grammar and a description of basic + functionality for URI. To understand what is a valid URI, both the + grammar and the associated description have to be studied. Some of + the functionality described is not applicable to all URI schemes, and + some operations are only possible when certain media types are + retrieved using the URI, regardless of the scheme used. + +Abstract + + A Uniform Resource Identifier (URI) is a compact string of characters + for identifying an abstract or physical resource. This document + defines the generic syntax of URI, including both absolute and + relative forms, and guidelines for their use; it revises and replaces + the generic definitions in RFC 1738 and RFC 1808. + + This document defines a grammar that is a superset of all valid URI, + such that an implementation can parse the common components of a URI + reference without knowing the scheme-specific requirements of every + possible identifier type. This document does not define a generative + grammar for URI; that task will be performed by the individual + specifications of each URI scheme. + + + + +Berners-Lee, et. al. Standards Track [Page 1] + +RFC 2396 URI Generic Syntax August 1998 + + +1. Introduction + + Uniform Resource Identifiers (URI) provide a simple and extensible + means for identifying a resource. This specification of URI syntax + and semantics is derived from concepts introduced by the World Wide + Web global information initiative, whose use of such objects dates + from 1990 and is described in "Universal Resource Identifiers in WWW" + [RFC1630]. The specification of URI is designed to meet the + recommendations laid out in "Functional Recommendations for Internet + Resource Locators" [RFC1736] and "Functional Requirements for Uniform + Resource Names" [RFC1737]. + + This document updates and merges "Uniform Resource Locators" + [RFC1738] and "Relative Uniform Resource Locators" [RFC1808] in order + to define a single, generic syntax for all URI. It excludes those + portions of RFC 1738 that defined the specific syntax of individual + URL schemes; those portions will be updated as separate documents, as + will the process for registration of new URI schemes. This document + does not discuss the issues and recommendation for dealing with + characters outside of the US-ASCII character set [ASCII]; those + recommendations are discussed in a separate document. + + All significant changes from the prior RFCs are noted in Appendix G. + +1.1 Overview of URI + + URI are characterized by the following definitions: + + Uniform + Uniformity provides several benefits: it allows different types + of resource identifiers to be used in the same context, even + when the mechanisms used to access those resources may differ; + it allows uniform semantic interpretation of common syntactic + conventions across different types of resource identifiers; it + allows introduction of new types of resource identifiers + without interfering with the way that existing identifiers are + used; and, it allows the identifiers to be reused in many + different contexts, thus permitting new applications or + protocols to leverage a pre-existing, large, and widely-used + set of resource identifiers. + + Resource + A resource can be anything that has identity. Familiar + examples include an electronic document, an image, a service + (e.g., "today's weather report for Los Angeles"), and a + collection of other resources. Not all resources are network + "retrievable"; e.g., human beings, corporations, and bound + books in a library can also be considered resources. + + + +Berners-Lee, et. al. Standards Track [Page 2] + +RFC 2396 URI Generic Syntax August 1998 + + + The resource is the conceptual mapping to an entity or set of + entities, not necessarily the entity which corresponds to that + mapping at any particular instance in time. Thus, a resource + can remain constant even when its content---the entities to + which it currently corresponds---changes over time, provided + that the conceptual mapping is not changed in the process. + + Identifier + An identifier is an object that can act as a reference to + something that has identity. In the case of URI, the object is + a sequence of characters with a restricted syntax. + + Having identified a resource, a system may perform a variety of + operations on the resource, as might be characterized by such words + as `access', `update', `replace', or `find attributes'. + +1.2. URI, URL, and URN + + A URI can be further classified as a locator, a name, or both. The + term "Uniform Resource Locator" (URL) refers to the subset of URI + that identify resources via a representation of their primary access + mechanism (e.g., their network "location"), rather than identifying + the resource by name or by some other attribute(s) of that resource. + The term "Uniform Resource Name" (URN) refers to the subset of URI + that are required to remain globally unique and persistent even when + the resource ceases to exist or becomes unavailable. + + The URI scheme (Section 3.1) defines the namespace of the URI, and + thus may further restrict the syntax and semantics of identifiers + using that scheme. This specification defines those elements of the + URI syntax that are either required of all URI schemes or are common + to many URI schemes. It thus defines the syntax and semantics that + are needed to implement a scheme-independent parsing mechanism for + URI references, such that the scheme-dependent handling of a URI can + be postponed until the scheme-dependent semantics are needed. We use + the term URL below when describing syntax or semantics that only + apply to locators. + + Although many URL schemes are named after protocols, this does not + imply that the only way to access the URL's resource is via the named + protocol. Gateways, proxies, caches, and name resolution services + might be used to access some resources, independent of the protocol + of their origin, and the resolution of some URL may require the use + of more than one protocol (e.g., both DNS and HTTP are typically used + to access an "http" URL's resource when it can't be found in a local + cache). + + + + + +Berners-Lee, et. al. Standards Track [Page 3] + +RFC 2396 URI Generic Syntax August 1998 + + + A URN differs from a URL in that it's primary purpose is persistent + labeling of a resource with an identifier. That identifier is drawn + from one of a set of defined namespaces, each of which has its own + set name structure and assignment procedures. The "urn" scheme has + been reserved to establish the requirements for a standardized URN + namespace, as defined in "URN Syntax" [RFC2141] and its related + specifications. + + Most of the examples in this specification demonstrate URL, since + they allow the most varied use of the syntax and often have a + hierarchical namespace. A parser of the URI syntax is capable of + parsing both URL and URN references as a generic URI; once the scheme + is determined, the scheme-specific parsing can be performed on the + generic URI components. In other words, the URI syntax is a superset + of the syntax of all URI schemes. + +1.3. Example URI + + The following examples illustrate URI that are in common use. + + ftp://ftp.is.co.za/rfc/rfc1808.txt + -- ftp scheme for File Transfer Protocol services + + gopher://spinaltap.micro.umn.edu/00/Weather/California/Los%20Angeles + -- gopher scheme for Gopher and Gopher+ Protocol services + + http://www.math.uio.no/faq/compression-faq/part1.html + -- http scheme for Hypertext Transfer Protocol services + + mailto:mduerst@ifi.unizh.ch + -- mailto scheme for electronic mail addresses + + news:comp.infosystems.www.servers.unix + -- news scheme for USENET news groups and articles + + telnet://melvyl.ucop.edu/ + -- telnet scheme for interactive services via the TELNET Protocol + +1.4. Hierarchical URI and Relative Forms + + An absolute identifier refers to a resource independent of the + context in which the identifier is used. In contrast, a relative + identifier refers to a resource by describing the difference within a + hierarchical namespace between the current context and an absolute + identifier of the resource. + + + + + + +Berners-Lee, et. al. Standards Track [Page 4] + +RFC 2396 URI Generic Syntax August 1998 + + + Some URI schemes support a hierarchical naming system, where the + hierarchy of the name is denoted by a "/" delimiter separating the + components in the scheme. This document defines a scheme-independent + `relative' form of URI reference that can be used in conjunction with + a `base' URI (of a hierarchical scheme) to produce another URI. The + syntax of hierarchical URI is described in Section 3; the relative + URI calculation is described in Section 5. + +1.5. URI Transcribability + + The URI syntax was designed with global transcribability as one of + its main concerns. A URI is a sequence of characters from a very + limited set, i.e. the letters of the basic Latin alphabet, digits, + and a few special characters. A URI may be represented in a variety + of ways: e.g., ink on paper, pixels on a screen, or a sequence of + octets in a coded character set. The interpretation of a URI depends + only on the characters used and not how those characters are + represented in a network protocol. + + The goal of transcribability can be described by a simple scenario. + Imagine two colleagues, Sam and Kim, sitting in a pub at an + international conference and exchanging research ideas. Sam asks Kim + for a location to get more information, so Kim writes the URI for the + research site on a napkin. Upon returning home, Sam takes out the + napkin and types the URI into a computer, which then retrieves the + information to which Kim referred. + + There are several design concerns revealed by the scenario: + + o A URI is a sequence of characters, which is not always + represented as a sequence of octets. + + o A URI may be transcribed from a non-network source, and thus + should consist of characters that are most likely to be able to + be typed into a computer, within the constraints imposed by + keyboards (and related input devices) across languages and + locales. + + o A URI often needs to be remembered by people, and it is easier + for people to remember a URI when it consists of meaningful + components. + + These design concerns are not always in alignment. For example, it + is often the case that the most meaningful name for a URI component + would require characters that cannot be typed into some systems. The + ability to transcribe the resource identifier from one medium to + another was considered more important than having its URI consist of + the most meaningful of components. In local and regional contexts + + + +Berners-Lee, et. al. Standards Track [Page 5] + +RFC 2396 URI Generic Syntax August 1998 + + + and with improving technology, users might benefit from being able to + use a wider range of characters; such use is not defined in this + document. + +1.6. Syntax Notation and Common Elements + + This document uses two conventions to describe and define the syntax + for URI. The first, called the layout form, is a general description + of the order of components and component separators, as in + + <first>/<second>;<third>?<fourth> + + The component names are enclosed in angle-brackets and any characters + outside angle-brackets are literal separators. Whitespace should be + ignored. These descriptions are used informally and do not define + the syntax requirements. + + The second convention is a BNF-like grammar, used to define the + formal URI syntax. The grammar is that of [RFC822], except that "|" + is used to designate alternatives. Briefly, rules are separated from + definitions by an equal "=", indentation is used to continue a rule + definition over more than one line, literals are quoted with "", + parentheses "(" and ")" are used to group elements, optional elements + are enclosed in "[" and "]" brackets, and elements may be preceded + with <n>* to designate n or more repetitions of the following + element; n defaults to 0. + + Unlike many specifications that use a BNF-like grammar to define the + bytes (octets) allowed by a protocol, the URI grammar is defined in + terms of characters. Each literal in the grammar corresponds to the + character it represents, rather than to the octet encoding of that + character in any particular coded character set. How a URI is + represented in terms of bits and bytes on the wire is dependent upon + the character encoding of the protocol used to transport it, or the + charset of the document which contains it. + + The following definitions are common to many elements: + + alpha = lowalpha | upalpha + + lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | + "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | + "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" + + upalpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | + "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | + "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" + + + + +Berners-Lee, et. al. Standards Track [Page 6] + +RFC 2396 URI Generic Syntax August 1998 + + + digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | + "8" | "9" + + alphanum = alpha | digit + + The complete URI syntax is collected in Appendix A. + +2. URI Characters and Escape Sequences + + URI consist of a restricted set of characters, primarily chosen to + aid transcribability and usability both in computer systems and in + non-computer communications. Characters used conventionally as + delimiters around URI were excluded. The restricted set of + characters consists of digits, letters, and a few graphic symbols + were chosen from those common to most of the character encodings and + input facilities available to Internet users. + + uric = reserved | unreserved | escaped + + Within a URI, characters are either used as delimiters, or to + represent strings of data (octets) within the delimited portions. + Octets are either represented directly by a character (using the US- + ASCII character for that octet [ASCII]) or by an escape encoding. + This representation is elaborated below. + +2.1 URI and non-ASCII characters + + The relationship between URI and characters has been a source of + confusion for characters that are not part of US-ASCII. To describe + the relationship, it is useful to distinguish between a "character" + (as a distinguishable semantic entity) and an "octet" (an 8-bit + byte). There are two mappings, one from URI characters to octets, and + a second from octets to original characters: + + URI character sequence->octet sequence->original character sequence + + A URI is represented as a sequence of characters, not as a sequence + of octets. That is because URI might be "transported" by means that + are not through a computer network, e.g., printed on paper, read over + the radio, etc. + + A URI scheme may define a mapping from URI characters to octets; + whether this is done depends on the scheme. Commonly, within a + delimited component of a URI, a sequence of characters may be used to + represent a sequence of octets. For example, the character "a" + represents the octet 97 (decimal), while the character sequence "%", + "0", "a" represents the octet 10 (decimal). + + + + +Berners-Lee, et. al. Standards Track [Page 7] + +RFC 2396 URI Generic Syntax August 1998 + + + There is a second translation for some resources: the sequence of + octets defined by a component of the URI is subsequently used to + represent a sequence of characters. A 'charset' defines this mapping. + There are many charsets in use in Internet protocols. For example, + UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences + of characters in the repertoire of ISO 10646. + + In the simplest case, the original character sequence contains only + characters that are defined in US-ASCII, and the two levels of + mapping are simple and easily invertible: each 'original character' + is represented as the octet for the US-ASCII code for it, which is, + in turn, represented as either the US-ASCII character, or else the + "%" escape sequence for that octet. + + For original character sequences that contain non-ASCII characters, + however, the situation is more difficult. Internet protocols that + transmit octet sequences intended to represent character sequences + are expected to provide some way of identifying the charset used, if + there might be more than one [RFC2277]. However, there is currently + no provision within the generic URI syntax to accomplish this + identification. An individual URI scheme may require a single + charset, define a default charset, or provide a way to indicate the + charset used. + + It is expected that a systematic treatment of character encoding + within URI will be developed as a future modification of this + specification. + +2.2. Reserved Characters + + Many URI include components consisting of or delimited by, certain + special characters. These characters are called "reserved", since + their usage within the URI component is limited to their reserved + purpose. If the data for a URI component would conflict with the + reserved purpose, then the conflicting data must be escaped before + forming the URI. + + reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | + "$" | "," + + The "reserved" syntax class above refers to those characters that are + allowed within a URI, but which may not be allowed within a + particular component of the generic URI syntax; they are used as + delimiters of the components described in Section 3. + + + + + + + +Berners-Lee, et. al. Standards Track [Page 8] + +RFC 2396 URI Generic Syntax August 1998 + + + Characters in the "reserved" set are not reserved in all contexts. + The set of characters actually reserved within any given URI + component is defined by that component. In general, a character is + reserved if the semantics of the URI changes if the character is + replaced with its escaped US-ASCII encoding. + +2.3. Unreserved Characters + + Data characters that are allowed in a URI but do not have a reserved + purpose are called unreserved. These include upper and lower case + letters, decimal digits, and a limited set of punctuation marks and + symbols. + + unreserved = alphanum | mark + + mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" + + Unreserved characters can be escaped without changing the semantics + of the URI, but this should not be done unless the URI is being used + in a context that does not allow the unescaped character to appear. + +2.4. Escape Sequences + + Data must be escaped if it does not have a representation using an + unreserved character; this includes data that does not correspond to + a printable character of the US-ASCII coded character set, or that + corresponds to any US-ASCII character that is disallowed, as + explained below. + +2.4.1. Escaped Encoding + + An escaped octet is encoded as a character triplet, consisting of the + percent character "%" followed by the two hexadecimal digits + representing the octet code. For example, "%20" is the escaped + encoding for the US-ASCII space character. + + escaped = "%" hex hex + hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | + "a" | "b" | "c" | "d" | "e" | "f" + +2.4.2. When to Escape and Unescape + + A URI is always in an "escaped" form, since escaping or unescaping a + completed URI might change its semantics. Normally, the only time + escape encodings can safely be made is when the URI is being created + from its component parts; each component may have its own set of + characters that are reserved, so only the mechanism responsible for + generating or interpreting that component can determine whether or + + + +Berners-Lee, et. al. Standards Track [Page 9] + +RFC 2396 URI Generic Syntax August 1998 + + + not escaping a character will change its semantics. Likewise, a URI + must be separated into its components before the escaped characters + within those components can be safely decoded. + + In some cases, data that could be represented by an unreserved + character may appear escaped; for example, some of the unreserved + "mark" characters are automatically escaped by some systems. If the + given URI scheme defines a canonicalization algorithm, then + unreserved characters may be unescaped according to that algorithm. + For example, "%7e" is sometimes used instead of "~" in an http URL + path, but the two are equivalent for an http URL. + + Because the percent "%" character always has the reserved purpose of + being the escape indicator, it must be escaped as "%25" in order to + be used as data within a URI. Implementers should be careful not to + escape or unescape the same string more than once, since unescaping + an already unescaped string might lead to misinterpreting a percent + data character as another escaped character, or vice versa in the + case of escaping an already escaped string. + +2.4.3. Excluded US-ASCII Characters + + Although they are disallowed within the URI syntax, we include here a + description of those US-ASCII characters that have been excluded and + the reasons for their exclusion. + + The control characters in the US-ASCII coded character set are not + used within a URI, both because they are non-printable and because + they are likely to be misinterpreted by some control mechanisms. + + control = <US-ASCII coded characters 00-1F and 7F hexadecimal> + + The space character is excluded because significant spaces may + disappear and insignificant spaces may be introduced when URI are + transcribed or typeset or subjected to the treatment of word- + processing programs. Whitespace is also used to delimit URI in many + contexts. + + space = <US-ASCII coded character 20 hexadecimal> + + The angle-bracket "<" and ">" and double-quote (") characters are + excluded because they are often used as the delimiters around URI in + text documents and protocol fields. The character "#" is excluded + because it is used to delimit a URI from a fragment identifier in URI + references (Section 4). The percent character "%" is excluded because + it is used for the encoding of escaped characters. + + delims = "<" | ">" | "#" | "%" | <"> + + + +Berners-Lee, et. al. Standards Track [Page 10] + +RFC 2396 URI Generic Syntax August 1998 + + + Other characters are excluded because gateways and other transport + agents are known to sometimes modify such characters, or they are + used as delimiters. + + unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`" + + Data corresponding to excluded characters must be escaped in order to + be properly represented within a URI. + +3. URI Syntactic Components + + The URI syntax is dependent upon the scheme. In general, absolute + URI are written as follows: + + <scheme>:<scheme-specific-part> + + An absolute URI contains the name of the scheme being used (<scheme>) + followed by a colon (":") and then a string (the <scheme-specific- + part>) whose interpretation depends on the scheme. + + The URI syntax does not require that the scheme-specific-part have + any general structure or set of semantics which is common among all + URI. However, a subset of URI do share a common syntax for + representing hierarchical relationships within the namespace. This + "generic URI" syntax consists of a sequence of four main components: + + <scheme>://<authority><path>?<query> + + each of which, except <scheme>, may be absent from a particular URI. + For example, some URI schemes do not allow an <authority> component, + and others do not use a <query> component. + + absoluteURI = scheme ":" ( hier_part | opaque_part ) + + URI that are hierarchical in nature use the slash "/" character for + separating hierarchical components. For some file systems, a "/" + character (used to denote the hierarchical structure of a URI) is the + delimiter used to construct a file name hierarchy, and thus the URI + path will look similar to a file pathname. This does NOT imply that + the resource is a file or that the URI maps to an actual filesystem + pathname. + + hier_part = ( net_path | abs_path ) [ "?" query ] + + net_path = "//" authority [ abs_path ] + + abs_path = "/" path_segments + + + + +Berners-Lee, et. al. Standards Track [Page 11] + +RFC 2396 URI Generic Syntax August 1998 + + + URI that do not make use of the slash "/" character for separating + hierarchical components are considered opaque by the generic URI + parser. + + opaque_part = uric_no_slash *uric + + uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" | + "&" | "=" | "+" | "$" | "," + + We use the term <path> to refer to both the <abs_path> and + <opaque_part> constructs, since they are mutually exclusive for any + given URI and can be parsed as a single component. + +3.1. Scheme Component + + Just as there are many different methods of access to resources, + there are a variety of schemes for identifying such resources. The + URI syntax consists of a sequence of components separated by reserved + characters, with the first component defining the semantics for the + remainder of the URI string. + + Scheme names consist of a sequence of characters beginning with a + lower case letter and followed by any combination of lower case + letters, digits, plus ("+"), period ("."), or hyphen ("-"). For + resiliency, programs interpreting URI should treat upper case letters + as equivalent to lower case in scheme names (e.g., allow "HTTP" as + well as "http"). + + scheme = alpha *( alpha | digit | "+" | "-" | "." ) + + Relative URI references are distinguished from absolute URI in that + they do not begin with a scheme name. Instead, the scheme is + inherited from the base URI, as described in Section 5.2. + +3.2. Authority Component + + Many URI schemes include a top hierarchical element for a naming + authority, such that the namespace defined by the remainder of the + URI is governed by that authority. This authority component is + typically defined by an Internet-based server or a scheme-specific + registry of naming authorities. + + authority = server | reg_name + + The authority component is preceded by a double slash "//" and is + terminated by the next slash "/", question-mark "?", or by the end of + the URI. Within the authority component, the characters ";", ":", + "@", "?", and "/" are reserved. + + + +Berners-Lee, et. al. Standards Track [Page 12] + +RFC 2396 URI Generic Syntax August 1998 + + + An authority component is not required for a URI scheme to make use + of relative references. A base URI without an authority component + implies that any relative reference will also be without an authority + component. + +3.2.1. Registry-based Naming Authority + + The structure of a registry-based naming authority is specific to the + URI scheme, but constrained to the allowed characters for an + authority component. + + reg_name = 1*( unreserved | escaped | "$" | "," | + ";" | ":" | "@" | "&" | "=" | "+" ) + +3.2.2. Server-based Naming Authority + + URL schemes that involve the direct use of an IP-based protocol to a + specified server on the Internet use a common syntax for the server + component of the URI's scheme-specific data: + + <userinfo>@<host>:<port> + + where <userinfo> may consist of a user name and, optionally, scheme- + specific information about how to gain authorization to access the + server. The parts "<userinfo>@" and ":<port>" may be omitted. + + server = [ [ userinfo "@" ] hostport ] + + The user information, if present, is followed by a commercial at-sign + "@". + + userinfo = *( unreserved | escaped | + ";" | ":" | "&" | "=" | "+" | "$" | "," ) + + Some URL schemes use the format "user:password" in the userinfo + field. This practice is NOT RECOMMENDED, because the passing of + authentication information in clear text (such as URI) has proven to + be a security risk in almost every case where it has been used. + + The host is a domain name of a network host, or its IPv4 address as a + set of four decimal digit groups separated by ".". Literal IPv6 + addresses are not supported. + + hostport = host [ ":" port ] + host = hostname | IPv4address + hostname = *( domainlabel "." ) toplabel [ "." ] + domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum + toplabel = alpha | alpha *( alphanum | "-" ) alphanum + + + +Berners-Lee, et. al. Standards Track [Page 13] + +RFC 2396 URI Generic Syntax August 1998 + + + IPv4address = 1*digit "." 1*digit "." 1*digit "." 1*digit + port = *digit + + Hostnames take the form described in Section 3 of [RFC1034] and + Section 2.1 of [RFC1123]: a sequence of domain labels separated by + ".", each domain label starting and ending with an alphanumeric + character and possibly also containing "-" characters. The rightmost + domain label of a fully qualified domain name will never start with a + digit, thus syntactically distinguishing domain names from IPv4 + addresses, and may be followed by a single "." if it is necessary to + distinguish between the complete domain name and any local domain. + To actually be "Uniform" as a resource locator, a URL hostname should + be a fully qualified domain name. In practice, however, the host + component may be a local domain literal. + + Note: A suitable representation for including a literal IPv6 + address as the host part of a URL is desired, but has not yet been + determined or implemented in practice. + + The port is the network port number for the server. Most schemes + designate protocols that have a default port number. Another port + number may optionally be supplied, in decimal, separated from the + host by a colon. If the port is omitted, the default port number is + assumed. + +3.3. Path Component + + The path component contains data, specific to the authority (or the + scheme if there is no authority component), identifying the resource + within the scope of that scheme and authority. + + path = [ abs_path | opaque_part ] + + path_segments = segment *( "/" segment ) + segment = *pchar *( ";" param ) + param = *pchar + + pchar = unreserved | escaped | + ":" | "@" | "&" | "=" | "+" | "$" | "," + + The path may consist of a sequence of path segments separated by a + single slash "/" character. Within a path segment, the characters + "/", ";", "=", and "?" are reserved. Each path segment may include a + sequence of parameters, indicated by the semicolon ";" character. + The parameters are not significant to the parsing of relative + references. + + + + + +Berners-Lee, et. al. Standards Track [Page 14] + +RFC 2396 URI Generic Syntax August 1998 + + +3.4. Query Component + + The query component is a string of information to be interpreted by + the resource. + + query = *uric + + Within a query component, the characters ";", "/", "?", ":", "@", + "&", "=", "+", ",", and "$" are reserved. + +4. URI References + + The term "URI-reference" is used here to denote the common usage of a + resource identifier. A URI reference may be absolute or relative, + and may have additional information attached in the form of a + fragment identifier. However, "the URI" that results from such a + reference includes only the absolute URI after the fragment + identifier (if any) is removed and after any relative URI is resolved + to its absolute form. Although it is possible to limit the + discussion of URI syntax and semantics to that of the absolute + result, most usage of URI is within general URI references, and it is + impossible to obtain the URI from such a reference without also + parsing the fragment and resolving the relative form. + + URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ] + + The syntax for relative URI is a shortened form of that for absolute + URI, where some prefix of the URI is missing and certain path + components ("." and "..") have a special meaning when, and only when, + interpreting a relative path. The relative URI syntax is defined in + Section 5. + +4.1. Fragment Identifier + + When a URI reference is used to perform a retrieval action on the + identified resource, the optional fragment identifier, separated from + the URI by a crosshatch ("#") character, consists of additional + reference information to be interpreted by the user agent after the + retrieval action has been successfully completed. As such, it is not + part of a URI, but is often used in conjunction with a URI. + + fragment = *uric + + The semantics of a fragment identifier is a property of the data + resulting from a retrieval action, regardless of the type of URI used + in the reference. Therefore, the format and interpretation of + fragment identifiers is dependent on the media type [RFC2046] of the + retrieval result. The character restrictions described in Section 2 + + + +Berners-Lee, et. al. Standards Track [Page 15] + +RFC 2396 URI Generic Syntax August 1998 + + + for URI also apply to the fragment in a URI-reference. Individual + media types may define additional restrictions or structure within + the fragment for specifying different types of "partial views" that + can be identified within that media type. + + A fragment identifier is only meaningful when a URI reference is + intended for retrieval and the result of that retrieval is a document + for which the identified fragment is consistently defined. + +4.2. Same-document References + + A URI reference that does not contain a URI is a reference to the + current document. In other words, an empty URI reference within a + document is interpreted as a reference to the start of that document, + and a reference containing only a fragment identifier is a reference + to the identified fragment of that document. Traversal of such a + reference should not result in an additional retrieval action. + However, if the URI reference occurs in a context that is always + intended to result in a new request, as in the case of HTML's FORM + element, then an empty URI reference represents the base URI of the + current document and should be replaced by that URI when transformed + into a request. + +4.3. Parsing a URI Reference + + A URI reference is typically parsed according to the four main + components and fragment identifier in order to determine what + components are present and whether the reference is relative or + absolute. The individual components are then parsed for their + subparts and, if not opaque, to verify their validity. + + Although the BNF defines what is allowed in each component, it is + ambiguous in terms of differentiating between an authority component + and a path component that begins with two slash characters. The + greedy algorithm is used for disambiguation: the left-most matching + rule soaks up as much of the URI reference string as it is capable of + matching. In other words, the authority component wins. + + Readers familiar with regular expressions should see Appendix B for a + concrete parsing example and test oracle. + +5. Relative URI References + + It is often the case that a group or "tree" of documents has been + constructed to serve a common purpose; the vast majority of URI in + these documents point to resources within the tree rather than + + + + + +Berners-Lee, et. al. Standards Track [Page 16] + +RFC 2396 URI Generic Syntax August 1998 + + + outside of it. Similarly, documents located at a particular site are + much more likely to refer to other resources at that site than to + resources at remote sites. + + Relative addressing of URI allows document trees to be partially + independent of their location and access scheme. For instance, it is + possible for a single set of hypertext documents to be simultaneously + accessible and traversable via each of the "file", "http", and "ftp" + schemes if the documents refer to each other using relative URI. + Furthermore, such document trees can be moved, as a whole, without + changing any of the relative references. Experience within the WWW + has demonstrated that the ability to perform relative referencing is + necessary for the long-term usability of embedded URI. + + The syntax for relative URI takes advantage of the <hier_part> syntax + of <absoluteURI> (Section 3) in order to express a reference that is + relative to the namespace of another hierarchical URI. + + relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ] + + A relative reference beginning with two slash characters is termed a + network-path reference, as defined by <net_path> in Section 3. Such + references are rarely used. + + A relative reference beginning with a single slash character is + termed an absolute-path reference, as defined by <abs_path> in + Section 3. + + A relative reference that does not begin with a scheme name or a + slash character is termed a relative-path reference. + + rel_path = rel_segment [ abs_path ] + + rel_segment = 1*( unreserved | escaped | + ";" | "@" | "&" | "=" | "+" | "$" | "," ) + + Within a relative-path reference, the complete path segments "." and + ".." have special meanings: "the current hierarchy level" and "the + level above this hierarchy level", respectively. Although this is + very similar to their use within Unix-based filesystems to indicate + directory levels, these path components are only considered special + when resolving a relative-path reference to its absolute form + (Section 5.2). + + Authors should be aware that a path segment which contains a colon + character cannot be used as the first segment of a relative URI path + (e.g., "this:that"), because it would be mistaken for a scheme name. + + + + +Berners-Lee, et. al. Standards Track [Page 17] + +RFC 2396 URI Generic Syntax August 1998 + + + It is therefore necessary to precede such segments with other + segments (e.g., "./this:that") in order for them to be referenced as + a relative path. + + It is not necessary for all URI within a given scheme to be + restricted to the <hier_part> syntax, since the hierarchical + properties of that syntax are only necessary when relative URI are + used within a particular document. Documents can only make use of + relative URI when their base URI fits within the <hier_part> syntax. + It is assumed that any document which contains a relative reference + will also have a base URI that obeys the syntax. In other words, + relative URI cannot be used within a document that has an unsuitable + base URI. + + Some URI schemes do not allow a hierarchical syntax matching the + <hier_part> syntax, and thus cannot use relative references. + +5.1. Establishing a Base URI + + The term "relative URI" implies that there exists some absolute "base + URI" against which the relative reference is applied. Indeed, the + base URI is necessary to define the semantics of any relative URI + reference; without it, a relative reference is meaningless. In order + for relative URI to be usable within a document, the base URI of that + document must be known to the parser. + + The base URI of a document can be established in one of four ways, + listed below in order of precedence. The order of precedence can be + thought of in terms of layers, where the innermost defined base URI + has the highest precedence. This can be visualized graphically as: + + .----------------------------------------------------------. + | .----------------------------------------------------. | + | | .----------------------------------------------. | | + | | | .----------------------------------------. | | | + | | | | .----------------------------------. | | | | + | | | | | <relative_reference> | | | | | + | | | | `----------------------------------' | | | | + | | | | (5.1.1) Base URI embedded in the | | | | + | | | | document's content | | | | + | | | `----------------------------------------' | | | + | | | (5.1.2) Base URI of the encapsulating entity | | | + | | | (message, document, or none). | | | + | | `----------------------------------------------' | | + | | (5.1.3) URI used to retrieve the entity | | + | `----------------------------------------------------' | + | (5.1.4) Default Base URI is application-dependent | + `----------------------------------------------------------' + + + +Berners-Lee, et. al. Standards Track [Page 18] + +RFC 2396 URI Generic Syntax August 1998 + + +5.1.1. Base URI within Document Content + + Within certain document media types, the base URI of the document can + be embedded within the content itself such that it can be readily + obtained by a parser. This can be useful for descriptive documents, + such as tables of content, which may be transmitted to others through + protocols other than their usual retrieval context (e.g., E-Mail or + USENET news). + + It is beyond the scope of this document to specify how, for each + media type, the base URI can be embedded. It is assumed that user + agents manipulating such media types will be able to obtain the + appropriate syntax from that media type's specification. An example + of how the base URI can be embedded in the Hypertext Markup Language + (HTML) [RFC1866] is provided in Appendix D. + + A mechanism for embedding the base URI within MIME container types + (e.g., the message and multipart types) is defined by MHTML + [RFC2110]. Protocols that do not use the MIME message header syntax, + but which do allow some form of tagged metainformation to be included + within messages, may define their own syntax for defining the base + URI as part of a message. + +5.1.2. Base URI from the Encapsulating Entity + + If no base URI is embedded, the base URI of a document is defined by + the document's retrieval context. For a document that is enclosed + within another entity (such as a message or another document), the + retrieval context is that entity; thus, the default base URI of the + document is the base URI of the entity in which the document is + encapsulated. + +5.1.3. Base URI from the Retrieval URI + + If no base URI is embedded and the document is not encapsulated + within some other entity (e.g., the top level of a composite entity), + then, if a URI was used to retrieve the base document, that URI shall + be considered the base URI. Note that if the retrieval was the + result of a redirected request, the last URI used (i.e., that which + resulted in the actual retrieval of the document) is the base URI. + +5.1.4. Default Base URI + + If none of the conditions described in Sections 5.1.1--5.1.3 apply, + then the base URI is defined by the context of the application. + Since this definition is necessarily application-dependent, failing + + + + + +Berners-Lee, et. al. Standards Track [Page 19] + +RFC 2396 URI Generic Syntax August 1998 + + + to define the base URI using one of the other methods may result in + the same content being interpreted differently by different types of + application. + + It is the responsibility of the distributor(s) of a document + containing relative URI to ensure that the base URI for that document + can be established. It must be emphasized that relative URI cannot + be used reliably in situations where the document's base URI is not + well-defined. + +5.2. Resolving Relative References to Absolute Form + + This section describes an example algorithm for resolving URI + references that might be relative to a given base URI. + + The base URI is established according to the rules of Section 5.1 and + parsed into the four main components as described in Section 3. Note + that only the scheme component is required to be present in the base + URI; the other components may be empty or undefined. A component is + undefined if its preceding separator does not appear in the URI + reference; the path component is never undefined, though it may be + empty. The base URI's query component is not used by the resolution + algorithm and may be discarded. + + For each URI reference, the following steps are performed in order: + + 1) The URI reference is parsed into the potential four components and + fragment identifier, as described in Section 4.3. + + 2) If the path component is empty and the scheme, authority, and + query components are undefined, then it is a reference to the + current document and we are done. Otherwise, the reference URI's + query and fragment components are defined as found (or not found) + within the URI reference and not inherited from the base URI. + + 3) If the scheme component is defined, indicating that the reference + starts with a scheme name, then the reference is interpreted as an + absolute URI and we are done. Otherwise, the reference URI's + scheme is inherited from the base URI's scheme component. + + Due to a loophole in prior specifications [RFC1630], some parsers + allow the scheme name to be present in a relative URI if it is the + same as the base URI scheme. Unfortunately, this can conflict + with the correct parsing of non-hierarchical URI. For backwards + compatibility, an implementation may work around such references + by removing the scheme if it matches that of the base URI and the + scheme is known to always use the <hier_part> syntax. The parser + + + + +Berners-Lee, et. al. Standards Track [Page 20] + +RFC 2396 URI Generic Syntax August 1998 + + + can then continue with the steps below for the remainder of the + reference components. Validating parsers should mark such a + misformed relative reference as an error. + + 4) If the authority component is defined, then the reference is a + network-path and we skip to step 7. Otherwise, the reference + URI's authority is inherited from the base URI's authority + component, which will also be undefined if the URI scheme does not + use an authority component. + + 5) If the path component begins with a slash character ("/"), then + the reference is an absolute-path and we skip to step 7. + + 6) If this step is reached, then we are resolving a relative-path + reference. The relative path needs to be merged with the base + URI's path. Although there are many ways to do this, we will + describe a simple method using a separate string buffer. + + a) All but the last segment of the base URI's path component is + copied to the buffer. In other words, any characters after the + last (right-most) slash character, if any, are excluded. + + b) The reference's path component is appended to the buffer + string. + + c) All occurrences of "./", where "." is a complete path segment, + are removed from the buffer string. + + d) If the buffer string ends with "." as a complete path segment, + that "." is removed. + + e) All occurrences of "<segment>/../", where <segment> is a + complete path segment not equal to "..", are removed from the + buffer string. Removal of these path segments is performed + iteratively, removing the leftmost matching pattern on each + iteration, until no matching pattern remains. + + f) If the buffer string ends with "<segment>/..", where <segment> + is a complete path segment not equal to "..", that + "<segment>/.." is removed. + + g) If the resulting buffer string still begins with one or more + complete path segments of "..", then the reference is + considered to be in error. Implementations may handle this + error by retaining these components in the resolved path (i.e., + treating them as part of the final URI), by removing them from + the resolved path (i.e., discarding relative levels above the + root), or by avoiding traversal of the reference. + + + +Berners-Lee, et. al. Standards Track [Page 21] + +RFC 2396 URI Generic Syntax August 1998 + + + h) The remaining buffer string is the reference URI's new path + component. + + 7) The resulting URI components, including any inherited from the + base URI, are recombined to give the absolute form of the URI + reference. Using pseudocode, this would be + + result = "" + + if scheme is defined then + append scheme to result + append ":" to result + + if authority is defined then + append "//" to result + append authority to result + + append path to result + + if query is defined then + append "?" to result + append query to result + + if fragment is defined then + append "#" to result + append fragment to result + + return result + + Note that we must be careful to preserve the distinction between a + component that is undefined, meaning that its separator was not + present in the reference, and a component that is empty, meaning + that the separator was present and was immediately followed by the + next component separator or the end of the reference. + + The above algorithm is intended to provide an example by which the + output of implementations can be tested -- implementation of the + algorithm itself is not required. For example, some systems may find + it more efficient to implement step 6 as a pair of segment stacks + being merged, rather than as a series of string pattern replacements. + + Note: Some WWW client applications will fail to separate the + reference's query component from its path component before merging + the base and reference paths in step 6 above. This may result in + a loss of information if the query component contains the strings + "/../" or "/./". + + Resolution examples are provided in Appendix C. + + + +Berners-Lee, et. al. Standards Track [Page 22] + +RFC 2396 URI Generic Syntax August 1998 + + +6. URI Normalization and Equivalence + + In many cases, different URI strings may actually identify the + identical resource. For example, the host names used in URL are + actually case insensitive, and the URL <http://www.XEROX.com> is + equivalent to <http://www.xerox.com>. In general, the rules for + equivalence and definition of a normal form, if any, are scheme + dependent. When a scheme uses elements of the common syntax, it will + also use the common syntax equivalence rules, namely that the scheme + and hostname are case insensitive and a URL with an explicit ":port", + where the port is the default for the scheme, is equivalent to one + where the port is elided. + +7. Security Considerations + + A URI does not in itself pose a security threat. Users should beware + that there is no general guarantee that a URL, which at one time + located a given resource, will continue to do so. Nor is there any + guarantee that a URL will not locate a different resource at some + later point in time, due to the lack of any constraint on how a given + authority apportions its namespace. Such a guarantee can only be + obtained from the person(s) controlling that namespace and the + resource in question. A specific URI scheme may include additional + semantics, such as name persistence, if those semantics are required + of all naming authorities for that scheme. + + It is sometimes possible to construct a URL such that an attempt to + perform a seemingly harmless, idempotent operation, such as the + retrieval of an entity associated with the resource, will in fact + cause a possibly damaging remote operation to occur. The unsafe URL + is typically constructed by specifying a port number other than that + reserved for the network protocol in question. The client + unwittingly contacts a site that is in fact running a different + protocol. The content of the URL contains instructions that, when + interpreted according to this other protocol, cause an unexpected + operation. An example has been the use of a gopher URL to cause an + unintended or impersonating message to be sent via a SMTP server. + + Caution should be used when using any URL that specifies a port + number other than the default for the protocol, especially when it is + a number within the reserved space. + + Care should be taken when a URL contains escaped delimiters for a + given protocol (for example, CR and LF characters for telnet + protocols) that these are not unescaped before transmission. This + might violate the protocol, but avoids the potential for such + + + + + +Berners-Lee, et. al. Standards Track [Page 23] + +RFC 2396 URI Generic Syntax August 1998 + + + characters to be used to simulate an extra operation or parameter in + that protocol, which might lead to an unexpected and possibly harmful + remote operation to be performed. + + It is clearly unwise to use a URL that contains a password which is + intended to be secret. In particular, the use of a password within + the 'userinfo' component of a URL is strongly disrecommended except + in those rare cases where the 'password' parameter is intended to be + public. + +8. Acknowledgements + + This document was derived from RFC 1738 [RFC1738] and RFC 1808 + [RFC1808]; the acknowledgements in those specifications still apply. + In addition, contributions by Gisle Aas, Martin Beet, Martin Duerst, + Jim Gettys, Martijn Koster, Dave Kristol, Daniel LaLiberte, Foteos + Macrides, James Marshall, Ryan Moats, Keith Moore, and Lauren Wood + are gratefully acknowledged. + +9. References + + [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and + Languages", BCP 18, RFC 2277, January 1998. + + [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A + Unifying Syntax for the Expression of Names and Addresses + of Objects on the Network as used in the World-Wide Web", + RFC 1630, June 1994. + + [RFC1738] Berners-Lee, T., Masinter, L., and M. McCahill, Editors, + "Uniform Resource Locators (URL)", RFC 1738, December 1994. + + [RFC1866] Berners-Lee T., and D. Connolly, "HyperText Markup Language + Specification -- 2.0", RFC 1866, November 1995. + + [RFC1123] Braden, R., Editor, "Requirements for Internet Hosts -- + Application and Support", STD 3, RFC 1123, October 1989. + + [RFC822] Crocker, D., "Standard for the Format of ARPA Internet Text + Messages", STD 11, RFC 822, August 1982. + + [RFC1808] Fielding, R., "Relative Uniform Resource Locators", RFC + 1808, June 1995. + + [RFC2046] Freed, N., and N. Borenstein, "Multipurpose Internet Mail + Extensions (MIME) Part Two: Media Types", RFC 2046, + November 1996. + + + + +Berners-Lee, et. al. Standards Track [Page 24] + +RFC 2396 URI Generic Syntax August 1998 + + + [RFC1736] Kunze, J., "Functional Recommendations for Internet + Resource Locators", RFC 1736, February 1995. + + [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. + + [RFC1034] Mockapetris, P., "Domain Names - Concepts and Facilities", + STD 13, RFC 1034, November 1987. + + [RFC2110] Palme, J., and A. Hopmann, "MIME E-mail Encapsulation of + Aggregate Documents, such as HTML (MHTML)", RFC 2110, March + 1997. + + [RFC1737] Sollins, K., and L. Masinter, "Functional Requirements for + Uniform Resource Names", RFC 1737, December 1994. + + [ASCII] US-ASCII. "Coded Character Set -- 7-bit American Standard + Code for Information Interchange", ANSI X3.4-1986. + + [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", + RFC 2279, January 1998. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Berners-Lee, et. al. Standards Track [Page 25] + +RFC 2396 URI Generic Syntax August 1998 + + +10. Authors' Addresses + + Tim Berners-Lee + World Wide Web Consortium + MIT Laboratory for Computer Science, NE43-356 + 545 Technology Square + Cambridge, MA 02139 + + Fax: +1(617)258-8682 + EMail: timbl@w3.org + + + Roy T. Fielding + Department of Information and Computer Science + University of California, Irvine + Irvine, CA 92697-3425 + + Fax: +1(949)824-1715 + EMail: fielding@ics.uci.edu + + + Larry Masinter + Xerox PARC + 3333 Coyote Hill Road + Palo Alto, CA 94034 + + Fax: +1(415)812-4333 + EMail: masinter@parc.xerox.com + + + + + + + + + + + + + + + + + + + + + + + +Berners-Lee, et. al. Standards Track [Page 26] + +RFC 2396 URI Generic Syntax August 1998 + + +A. Collected BNF for URI + + URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ] + absoluteURI = scheme ":" ( hier_part | opaque_part ) + relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ] + + hier_part = ( net_path | abs_path ) [ "?" query ] + opaque_part = uric_no_slash *uric + + uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" | + "&" | "=" | "+" | "$" | "," + + net_path = "//" authority [ abs_path ] + abs_path = "/" path_segments + rel_path = rel_segment [ abs_path ] + + rel_segment = 1*( unreserved | escaped | + ";" | "@" | "&" | "=" | "+" | "$" | "," ) + + scheme = alpha *( alpha | digit | "+" | "-" | "." ) + + authority = server | reg_name + + reg_name = 1*( unreserved | escaped | "$" | "," | + ";" | ":" | "@" | "&" | "=" | "+" ) + + server = [ [ userinfo "@" ] hostport ] + userinfo = *( unreserved | escaped | + ";" | ":" | "&" | "=" | "+" | "$" | "," ) + + hostport = host [ ":" port ] + host = hostname | IPv4address + hostname = *( domainlabel "." ) toplabel [ "." ] + domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum + toplabel = alpha | alpha *( alphanum | "-" ) alphanum + IPv4address = 1*digit "." 1*digit "." 1*digit "." 1*digit + port = *digit + + path = [ abs_path | opaque_part ] + path_segments = segment *( "/" segment ) + segment = *pchar *( ";" param ) + param = *pchar + pchar = unreserved | escaped | + ":" | "@" | "&" | "=" | "+" | "$" | "," + + query = *uric + + fragment = *uric + + + +Berners-Lee, et. al. Standards Track [Page 27] + +RFC 2396 URI Generic Syntax August 1998 + + + uric = reserved | unreserved | escaped + reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | + "$" | "," + unreserved = alphanum | mark + mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | + "(" | ")" + + escaped = "%" hex hex + hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | + "a" | "b" | "c" | "d" | "e" | "f" + + alphanum = alpha | digit + alpha = lowalpha | upalpha + + lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | + "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | + "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" + upalpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | + "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | + "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" + digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | + "8" | "9" + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Berners-Lee, et. al. Standards Track [Page 28] + +RFC 2396 URI Generic Syntax August 1998 + + +B. Parsing a URI Reference with a Regular Expression + + As described in Section 4.3, the generic URI syntax is not sufficient + to disambiguate the components of some forms of URI. Since the + "greedy algorithm" described in that section is identical to the + disambiguation method used by POSIX regular expressions, it is + natural and commonplace to use a regular expression for parsing the + potential four components and fragment identifier of a URI reference. + + The following line is the regular expression for breaking-down a URI + reference into its components. + + ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? + 12 3 4 5 6 7 8 9 + + The numbers in the second line above are only to assist readability; + they indicate the reference points for each subexpression (i.e., each + paired parenthesis). We refer to the value matched for subexpression + <n> as $<n>. For example, matching the above expression to + + http://www.ics.uci.edu/pub/ietf/uri/#Related + + results in the following subexpression matches: + + $1 = http: + $2 = http + $3 = //www.ics.uci.edu + $4 = www.ics.uci.edu + $5 = /pub/ietf/uri/ + $6 = <undefined> + $7 = <undefined> + $8 = #Related + $9 = Related + + where <undefined> indicates that the component is not present, as is + the case for the query component in the above example. Therefore, we + can determine the value of the four components and fragment as + + scheme = $2 + authority = $4 + path = $5 + query = $7 + fragment = $9 + + and, going in the opposite direction, we can recreate a URI reference + from its components using the algorithm in step 7 of Section 5.2. + + + + + +Berners-Lee, et. al. Standards Track [Page 29] + +RFC 2396 URI Generic Syntax August 1998 + + +C. Examples of Resolving Relative URI References + + Within an object with a well-defined base URI of + + http://a/b/c/d;p?q + + the relative URI would be resolved as follows: + +C.1. Normal Examples + + g:h = g:h + g = http://a/b/c/g + ./g = http://a/b/c/g + g/ = http://a/b/c/g/ + /g = http://a/g + //g = http://g + ?y = http://a/b/c/?y + g?y = http://a/b/c/g?y + #s = (current document)#s + g#s = http://a/b/c/g#s + g?y#s = http://a/b/c/g?y#s + ;x = http://a/b/c/;x + g;x = http://a/b/c/g;x + g;x?y#s = http://a/b/c/g;x?y#s + . = http://a/b/c/ + ./ = http://a/b/c/ + .. = http://a/b/ + ../ = http://a/b/ + ../g = http://a/b/g + ../.. = http://a/ + ../../ = http://a/ + ../../g = http://a/g + +C.2. Abnormal Examples + + Although the following abnormal examples are unlikely to occur in + normal practice, all URI parsers should be capable of resolving them + consistently. Each example uses the same base as above. + + An empty reference refers to the start of the current document. + + <> = (current document) + + Parsers must be careful in handling the case where there are more + relative path ".." segments than there are hierarchical levels in the + base URI's path. Note that the ".." syntax cannot be used to change + the authority component of a URI. + + + + +Berners-Lee, et. al. Standards Track [Page 30] + +RFC 2396 URI Generic Syntax August 1998 + + + ../../../g = http://a/../g + ../../../../g = http://a/../../g + + In practice, some implementations strip leading relative symbolic + elements (".", "..") after applying a relative URI calculation, based + on the theory that compensating for obvious author errors is better + than allowing the request to fail. Thus, the above two references + will be interpreted as "http://a/g" by some implementations. + + Similarly, parsers must avoid treating "." and ".." as special when + they are not complete components of a relative path. + + /./g = http://a/./g + /../g = http://a/../g + g. = http://a/b/c/g. + .g = http://a/b/c/.g + g.. = http://a/b/c/g.. + ..g = http://a/b/c/..g + + Less likely are cases where the relative URI uses unnecessary or + nonsensical forms of the "." and ".." complete path segments. + + ./../g = http://a/b/g + ./g/. = http://a/b/c/g/ + g/./h = http://a/b/c/g/h + g/../h = http://a/b/c/h + g;x=1/./y = http://a/b/c/g;x=1/y + g;x=1/../y = http://a/b/c/y + + All client applications remove the query component from the base URI + before resolving relative URI. However, some applications fail to + separate the reference's query and/or fragment components from a + relative path before merging it with the base path. This error is + rarely noticed, since typical usage of a fragment never includes the + hierarchy ("/") character, and the query component is not normally + used within relative references. + + g?y/./x = http://a/b/c/g?y/./x + g?y/../x = http://a/b/c/g?y/../x + g#s/./x = http://a/b/c/g#s/./x + g#s/../x = http://a/b/c/g#s/../x + + + + + + + + + + +Berners-Lee, et. al. Standards Track [Page 31] + +RFC 2396 URI Generic Syntax August 1998 + + + Some parsers allow the scheme name to be present in a relative URI if + it is the same as the base URI scheme. This is considered to be a + loophole in prior specifications of partial URI [RFC1630]. Its use + should be avoided. + + http:g = http:g ; for validating parsers + | http://a/b/c/g ; for backwards compatibility + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Berners-Lee, et. al. Standards Track [Page 32] + +RFC 2396 URI Generic Syntax August 1998 + + +D. Embedding the Base URI in HTML documents + + It is useful to consider an example of how the base URI of a document + can be embedded within the document's content. In this appendix, we + describe how documents written in the Hypertext Markup Language + (HTML) [RFC1866] can include an embedded base URI. This appendix + does not form a part of the URI specification and should not be + considered as anything more than a descriptive example. + + HTML defines a special element "BASE" which, when present in the + "HEAD" portion of a document, signals that the parser should use the + BASE element's "HREF" attribute as the base URI for resolving any + relative URI. The "HREF" attribute must be an absolute URI. Note + that, in HTML, element and attribute names are case-insensitive. For + example: + + <!doctype html public "-//IETF//DTD HTML//EN"> + <HTML><HEAD> + <TITLE>An example HTML document</TITLE> + <BASE href="http://www.ics.uci.edu/Test/a/b/c"> + </HEAD><BODY> + ... <A href="../x">a hypertext anchor</A> ... + </BODY></HTML> + + A parser reading the example document should interpret the given + relative URI "../x" as representing the absolute URI + + <http://www.ics.uci.edu/Test/a/x> + + regardless of the context in which the example document was obtained. + + + + + + + + + + + + + + + + + + + + + +Berners-Lee, et. al. Standards Track [Page 33] + +RFC 2396 URI Generic Syntax August 1998 + + +E. Recommendations for Delimiting URI in Context + + URI are often transmitted through formats that do not provide a clear + context for their interpretation. For example, there are many + occasions when URI are included in plain text; examples include text + sent in electronic mail, USENET news messages, and, most importantly, + printed on paper. In such cases, it is important to be able to + delimit the URI from the rest of the text, and in particular from + punctuation marks that might be mistaken for part of the URI. + + In practice, URI are delimited in a variety of ways, but usually + within double-quotes "http://test.com/", angle brackets + <http://test.com/>, or just using whitespace + + http://test.com/ + + These wrappers do not form part of the URI. + + In the case where a fragment identifier is associated with a URI + reference, the fragment would be placed within the brackets as well + (separated from the URI with a "#" character). + + In some cases, extra whitespace (spaces, linebreaks, tabs, etc.) may + need to be added to break long URI across lines. The whitespace + should be ignored when extracting the URI. + + No whitespace should be introduced after a hyphen ("-") character. + Because some typesetters and printers may (erroneously) introduce a + hyphen at the end of line when breaking a line, the interpreter of a + URI containing a line break immediately after a hyphen should ignore + all unescaped whitespace around the line break, and should be aware + that the hyphen may or may not actually be part of the URI. + + Using <> angle brackets around each URI is especially recommended as + a delimiting style for URI that contain whitespace. + + The prefix "URL:" (with or without a trailing space) was recommended + as a way to used to help distinguish a URL from other bracketed + designators, although this is not common in practice. + + For robustness, software that accepts user-typed URI should attempt + to recognize and strip both delimiters and embedded whitespace. + + For example, the text: + + + + + + + +Berners-Lee, et. al. Standards Track [Page 34] + +RFC 2396 URI Generic Syntax August 1998 + + + Yes, Jim, I found it under "http://www.w3.org/Addressing/", + but you can probably pick it up from <ftp://ds.internic. + net/rfc/>. Note the warning in <http://www.ics.uci.edu/pub/ + ietf/uri/historical.html#WARNING>. + + contains the URI references + + http://www.w3.org/Addressing/ + ftp://ds.internic.net/rfc/ + http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Berners-Lee, et. al. Standards Track [Page 35] + +RFC 2396 URI Generic Syntax August 1998 + + +F. Abbreviated URLs + + The URL syntax was designed for unambiguous reference to network + resources and extensibility via the URL scheme. However, as URL + identification and usage have become commonplace, traditional media + (television, radio, newspapers, billboards, etc.) have increasingly + used abbreviated URL references. That is, a reference consisting of + only the authority and path portions of the identified resource, such + as + + www.w3.org/Addressing/ + + or simply the DNS hostname on its own. Such references are primarily + intended for human interpretation rather than machine, with the + assumption that context-based heuristics are sufficient to complete + the URL (e.g., most hostnames beginning with "www" are likely to have + a URL prefix of "http://"). Although there is no standard set of + heuristics for disambiguating abbreviated URL references, many client + implementations allow them to be entered by the user and + heuristically resolved. It should be noted that such heuristics may + change over time, particularly when new URL schemes are introduced. + + Since an abbreviated URL has the same syntax as a relative URL path, + abbreviated URL references cannot be used in contexts where relative + URLs are expected. This limits the use of abbreviated URLs to places + where there is no defined base URL, such as dialog boxes and off-line + advertisements. + + + + + + + + + + + + + + + + + + + + + + + + +Berners-Lee, et. al. Standards Track [Page 36] + +RFC 2396 URI Generic Syntax August 1998 + + +G. Summary of Non-editorial Changes + +G.1. Additions + + Section 4 (URI References) was added to stem the confusion regarding + "what is a URI" and how to describe fragment identifiers given that + they are not part of the URI, but are part of the URI syntax and + parsing concerns. In addition, it provides a reference definition + for use by other IETF specifications (HTML, HTTP, etc.) that have + previously attempted to redefine the URI syntax in order to account + for the presence of fragment identifiers in URI references. + + Section 2.4 was rewritten to clarify a number of misinterpretations + and to leave room for fully internationalized URI. + + Appendix F on abbreviated URLs was added to describe the shortened + references often seen on television and magazine advertisements and + explain why they are not used in other contexts. + +G.2. Modifications from both RFC 1738 and RFC 1808 + + Changed to URI syntax instead of just URL. + + Confusion regarding the terms "character encoding", the URI + "character set", and the escaping of characters with %<hex><hex> + equivalents has (hopefully) been reduced. Many of the BNF rule names + regarding the character sets have been changed to more accurately + describe their purpose and to encompass all "characters" rather than + just US-ASCII octets. Unless otherwise noted here, these + modifications do not affect the URI syntax. + + Both RFC 1738 and RFC 1808 refer to the "reserved" set of characters + as if URI-interpreting software were limited to a single set of + characters with a reserved purpose (i.e., as meaning something other + than the data to which the characters correspond), and that this set + was fixed by the URI scheme. However, this has not been true in + practice; any character that is interpreted differently when it is + escaped is, in effect, reserved. Furthermore, the interpreting + engine on a HTTP server is often dependent on the resource, not just + the URI scheme. The description of reserved characters has been + changed accordingly. + + The plus "+", dollar "$", and comma "," characters have been added to + those in the "reserved" set, since they are treated as reserved + within the query component. + + + + + + +Berners-Lee, et. al. Standards Track [Page 37] + +RFC 2396 URI Generic Syntax August 1998 + + + The tilde "~" character was added to those in the "unreserved" set, + since it is extensively used on the Internet in spite of the + difficulty to transcribe it with some keyboards. + + The syntax for URI scheme has been changed to require that all + schemes begin with an alpha character. + + The "user:password" form in the previous BNF was changed to a + "userinfo" token, and the possibility that it might be + "user:password" made scheme specific. In particular, the use of + passwords in the clear is not even suggested by the syntax. + + The question-mark "?" character was removed from the set of allowed + characters for the userinfo in the authority component, since testing + showed that many applications treat it as reserved for separating the + query component from the rest of the URI. + + The semicolon ";" character was added to those stated as being + reserved within the authority component, since several new schemes + are using it as a separator within userinfo to indicate the type of + user authentication. + + RFC 1738 specified that the path was separated from the authority + portion of a URI by a slash. RFC 1808 followed suit, but with a + fudge of carrying around the separator as a "prefix" in order to + describe the parsing algorithm. RFC 1630 never had this problem, + since it considered the slash to be part of the path. In writing + this specification, it was found to be impossible to accurately + describe and retain the difference between the two URI + <foo:/bar> and <foo:bar> + without either considering the slash to be part of the path (as + corresponds to actual practice) or creating a separate component just + to hold that slash. We chose the former. + +G.3. Modifications from RFC 1738 + + The definition of specific URL schemes and their scheme-specific + syntax and semantics has been moved to separate documents. + + The URL host was defined as a fully-qualified domain name. However, + many URLs are used without fully-qualified domain names (in contexts + for which the full qualification is not necessary), without any host + (as in some file URLs), or with a host of "localhost". + + The URL port is now *digit instead of 1*digit, since systems are + expected to handle the case where the ":" separator between host and + port is supplied without a port. + + + + +Berners-Lee, et. al. Standards Track [Page 38] + +RFC 2396 URI Generic Syntax August 1998 + + + The recommendations for delimiting URI in context (Appendix E) have + been adjusted to reflect current practice. + +G.4. Modifications from RFC 1808 + + RFC 1808 (Section 4) defined an empty URL reference (a reference + containing nothing aside from the fragment identifier) as being a + reference to the base URL. Unfortunately, that definition could be + interpreted, upon selection of such a reference, as a new retrieval + action on that resource. Since the normal intent of such references + is for the user agent to change its view of the current document to + the beginning of the specified fragment within that document, not to + make an additional request of the resource, a description of how to + correctly interpret an empty reference has been added in Section 4. + + The description of the mythical Base header field has been replaced + with a reference to the Content-Location header field defined by + MHTML [RFC2110]. + + RFC 1808 described various schemes as either having or not having the + properties of the generic URI syntax. However, the only requirement + is that the particular document containing the relative references + have a base URI that abides by the generic URI syntax, regardless of + the URI scheme, so the associated description has been updated to + reflect that. + + The BNF term <net_loc> has been replaced with <authority>, since the + latter more accurately describes its use and purpose. Likewise, the + authority is no longer restricted to the IP server syntax. + + Extensive testing of current client applications demonstrated that + the majority of deployed systems do not use the ";" character to + indicate trailing parameter information, and that the presence of a + semicolon in a path segment does not affect the relative parsing of + that segment. Therefore, parameters have been removed as a separate + component and may now appear in any path segment. Their influence + has been removed from the algorithm for resolving a relative URI + reference. The resolution examples in Appendix C have been modified + to reflect this change. + + Implementations are now allowed to work around misformed relative + references that are prefixed by the same scheme as the base URI, but + only for schemes known to use the <hier_part> syntax. + + + + + + + + +Berners-Lee, et. al. Standards Track [Page 39] + +RFC 2396 URI Generic Syntax August 1998 + + +H. Full Copyright Statement + + Copyright (C) The Internet Society (1998). All Rights Reserved. + + This document and translations of it may be copied and furnished to + others, and derivative works that comment on or otherwise explain it + or assist in its implementation may be prepared, copied, published + and distributed, in whole or in part, without restriction of any + kind, provided that the above copyright notice and this paragraph are + included on all such copies and derivative works. However, this + document itself may not be modified in any way, such as by removing + the copyright notice or references to the Internet Society or other + Internet organizations, except as needed for the purpose of + developing Internet standards in which case the procedures for + copyrights defined in the Internet Standards process must be + followed, or as required to translate it into languages other than + English. + + The limited permissions granted above are perpetual and will not be + revoked by the Internet Society or its successors or assigns. + + This document and the information contained herein is provided on an + "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING + TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING + BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION + HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF + MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. + + + + + + + + + + + + + + + + + + + + + + + + +Berners-Lee, et. al. Standards Track [Page 40] + |