summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc9485.txt
diff options
context:
space:
mode:
authorThomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
committerThomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
commit4bfd864f10b68b71482b35c818559068ef8d5797 (patch)
treee3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc9485.txt
parentea76e11061bda059ae9f9ad130a9895cc85607db (diff)
doc: Add RFC documents
Diffstat (limited to 'doc/rfc/rfc9485.txt')
-rw-r--r--doc/rfc/rfc9485.txt486
1 files changed, 486 insertions, 0 deletions
diff --git a/doc/rfc/rfc9485.txt b/doc/rfc/rfc9485.txt
new file mode 100644
index 0000000..157ff1e
--- /dev/null
+++ b/doc/rfc/rfc9485.txt
@@ -0,0 +1,486 @@
+
+
+
+
+Internet Engineering Task Force (IETF) C. Bormann
+Request for Comments: 9485 Universität Bremen TZI
+Category: Standards Track T. Bray
+ISSN: 2070-1721 Textuality
+ October 2023
+
+
+ I-Regexp: An Interoperable Regular Expression Format
+
+Abstract
+
+ This document specifies I-Regexp, a flavor of regular expression that
+ is limited in scope with the goal of interoperation across many
+ different regular expression libraries.
+
+Status of This Memo
+
+ This is an Internet Standards Track document.
+
+ This document is a product of the Internet Engineering Task Force
+ (IETF). It represents the consensus of the IETF community. It has
+ received public review and has been approved for publication by the
+ Internet Engineering Steering Group (IESG). Further information on
+ Internet Standards is available in Section 2 of RFC 7841.
+
+ Information about the current status of this document, any errata,
+ and how to provide feedback on it may be obtained at
+ https://www.rfc-editor.org/info/rfc9485.
+
+Copyright Notice
+
+ Copyright (c) 2023 IETF Trust and the persons identified as the
+ document authors. All rights reserved.
+
+ This document is subject to BCP 78 and the IETF Trust's Legal
+ Provisions Relating to IETF Documents
+ (https://trustee.ietf.org/license-info) in effect on the date of
+ publication of this document. Please review these documents
+ carefully, as they describe your rights and restrictions with respect
+ to this document. Code Components extracted from this document must
+ include Revised BSD License text as described in Section 4.e of the
+ Trust Legal Provisions and are provided without warranty as described
+ in the Revised BSD License.
+
+Table of Contents
+
+ 1. Introduction
+ 1.1. Terminology
+ 2. Objectives
+ 3. I-Regexp Syntax
+ 3.1. Checking Implementations
+ 4. I-Regexp Semantics
+ 5. Mapping I-Regexp to Regexp Dialects
+ 5.1. Multi-Character Escapes
+ 5.2. XSD Regexps
+ 5.3. ECMAScript Regexps
+ 5.4. PCRE, RE2, and Ruby Regexps
+ 6. Motivation and Background
+ 6.1. Implementing I-Regexp
+ 7. IANA Considerations
+ 8. Security Considerations
+ 9. References
+ 9.1. Normative References
+ 9.2. Informative References
+ Acknowledgements
+ Authors' Addresses
+
+1. Introduction
+
+ This specification describes an interoperable regular expression
+ (abbreviated as "regexp") flavor, I-Regexp.
+
+ I-Regexp does not provide advanced regular expression features such
+ as capture groups, lookahead, or backreferences. It supports only a
+ Boolean matching capability, i.e., testing whether a given regular
+ expression matches a given piece of text.
+
+ I-Regexp supports the entire repertoire of Unicode characters
+ (Unicode scalar values); both the I-Regexp strings themselves and the
+ strings they are matched against are sequences of Unicode scalar
+ values (often represented in UTF-8 encoding form [STD63] for
+ interchange).
+
+ I-Regexp is a subset of XML Schema Definition (XSD) regular
+ expressions [XSD-2].
+
+ This document includes guidance for converting I-Regexps for use with
+ several well-known regular expression idioms.
+
+ The development of I-Regexp was motivated by the work of the JSONPath
+ Working Group (WG). The WG wanted to include support for the use of
+ regular expressions in JSONPath filters in its specification
+ [JSONPATH-BASE], but was unable to find a useful specification for
+ regular expressions that would be interoperable across the popular
+ libraries.
+
+1.1. Terminology
+
+ This document uses the abbreviation "regexp" for what is usually
+ called a "regular expression" in programming. The term "I-Regexp" is
+ used as a noun meaning a character string (sequence of Unicode scalar
+ values) that conforms to the requirements in this specification; the
+ plural is "I-Regexps".
+
+ This specification uses Unicode terminology; a good entry point is
+ provided by [UNICODE-GLOSSARY].
+
+ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+ "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
+ "OPTIONAL" in this document are to be interpreted as described in
+ BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
+ capitals, as shown here.
+
+ The grammatical rules in this document are to be interpreted as ABNF,
+ as described in [RFC5234] and [RFC7405], where the "characters" of
+ Section 2.3 of [RFC5234] are Unicode scalar values.
+
+2. Objectives
+
+ I-Regexps should handle the vast majority of practical cases where a
+ matching regexp is needed in a data-model specification or a query-
+ language expression.
+
+ At the time of writing, an editor of this document conducted a survey
+ of the regexp syntax used in recently published RFCs. All examples
+ found there should be covered by I-Regexps, both syntactically and
+ with their intended semantics. The exception is the use of multi-
+ character escapes, for which workaround guidance is provided in
+ Section 5.
+
+3. I-Regexp Syntax
+
+ An I-Regexp MUST conform to the ABNF specification in Figure 1.
+
+ i-regexp = branch *( "|" branch )
+ branch = *piece
+ piece = atom [ quantifier ]
+ quantifier = ( "*" / "+" / "?" ) / range-quantifier
+ range-quantifier = "{" QuantExact [ "," [ QuantExact ] ] "}"
+ QuantExact = 1*%x30-39 ; '0'-'9'
+
+ atom = NormalChar / charClass / ( "(" i-regexp ")" )
+ NormalChar = ( %x00-27 / "," / "-" / %x2F-3E ; '/'-'>'
+ / %x40-5A ; '@'-'Z'
+ / %x5E-7A ; '^'-'z'
+ / %x7E-D7FF ; skip surrogate code points
+ / %xE000-10FFFF )
+ charClass = "." / SingleCharEsc / charClassEsc / charClassExpr
+ SingleCharEsc = "\" ( %x28-2B ; '('-'+'
+ / "-" / "." / "?" / %x5B-5E ; '['-'^'
+ / %s"n" / %s"r" / %s"t" / %x7B-7D ; '{'-'}'
+ )
+ charClassEsc = catEsc / complEsc
+ charClassExpr = "[" [ "^" ] ( "-" / CCE1 ) *CCE1 [ "-" ] "]"
+ CCE1 = ( CCchar [ "-" CCchar ] ) / charClassEsc
+ CCchar = ( %x00-2C / %x2E-5A ; '.'-'Z'
+ / %x5E-D7FF ; skip surrogate code points
+ / %xE000-10FFFF ) / SingleCharEsc
+ catEsc = %s"\p{" charProp "}"
+ complEsc = %s"\P{" charProp "}"
+ charProp = IsCategory
+ IsCategory = Letters / Marks / Numbers / Punctuation / Separators /
+ Symbols / Others
+ Letters = %s"L" [ ( %s"l" / %s"m" / %s"o" / %s"t" / %s"u" ) ]
+ Marks = %s"M" [ ( %s"c" / %s"e" / %s"n" ) ]
+ Numbers = %s"N" [ ( %s"d" / %s"l" / %s"o" ) ]
+ Punctuation = %s"P" [ ( %x63-66 ; 'c'-'f'
+ / %s"i" / %s"o" / %s"s" ) ]
+ Separators = %s"Z" [ ( %s"l" / %s"p" / %s"s" ) ]
+ Symbols = %s"S" [ ( %s"c" / %s"k" / %s"m" / %s"o" ) ]
+ Others = %s"C" [ ( %s"c" / %s"f" / %s"n" / %s"o" ) ]
+
+ Figure 1: I-Regexp Syntax in ABNF
+
+ As an additional restriction, charClassExpr is not allowed to match
+ [^], which, according to this grammar, would parse as a positive
+ character class containing the single character ^.
+
+ This is essentially an XSD regexp without:
+
+ * character class subtraction,
+
+ * multi-character escapes such as \s, \S, and \w, and
+
+ * Unicode blocks.
+
+ An I-Regexp implementation MUST be a complete implementation of this
+ limited subset. In particular, full support for the Unicode
+ functionality defined in this specification is REQUIRED. The
+ implementation:
+
+ * MUST NOT limit itself to 7- or 8-bit character sets such as ASCII,
+ and
+
+ * MUST support the Unicode character property set in character
+ classes.
+
+3.1. Checking Implementations
+
+ A _checking_ I-Regexp implementation is one that checks a supplied
+ regexp for compliance with this specification and reports any
+ problems. Checking implementations give their users confidence that
+ they didn't accidentally insert syntax that is not interoperable, so
+ checking is RECOMMENDED. Exceptions to this rule may be made for
+ low-effort implementations that map I-Regexp to another regexp
+ library by simple steps such as performing the mapping operations
+ discussed in Section 5. Here, the effort needed to do full checking
+ might dwarf the rest of the implementation effort. Implementations
+ SHOULD document whether or not they are checking.
+
+ Specifications that employ I-Regexp may want to define in which cases
+ their implementations can work with a non-checking I-Regexp
+ implementation and when full checking is needed, possibly in the
+ process of defining their own implementation classes.
+
+4. I-Regexp Semantics
+
+ This syntax is a subset of that of [XSD-2]. Implementations that
+ interpret I-Regexps MUST yield Boolean results as specified in
+ [XSD-2]. (See also Section 5.2.)
+
+5. Mapping I-Regexp to Regexp Dialects
+
+ The material in this section is not normative; it is provided as
+ guidance to developers who want to use I-Regexps in the context of
+ other regular expression dialects.
+
+5.1. Multi-Character Escapes
+
+ I-Regexp does not support common multi-character escapes (MCEs) and
+ character classes built around them. These can usually be replaced
+ as shown by the examples in Table 1.
+
+ +============+===============+
+ | MCE/class: | Replace with: |
+ +============+===============+
+ | \S | [^ \t\n\r] |
+ +------------+---------------+
+ | [\S ] | [^\t\n\r] |
+ +------------+---------------+
+ | \d | [0-9] |
+ +------------+---------------+
+
+ Table 1: Example
+ Substitutes for Multi-
+ Character Escapes
+
+ Note that the semantics of \d in XSD regular expressions is that of
+ \p{Nd}; however, this would include all Unicode characters that are
+ digits in various writing systems, which is almost certainly not what
+ is required in IETF publications.
+
+ The construct \p{IsBasicLatin} is essentially a reference to legacy
+ ASCII; it can be replaced by the character class [\u0000-\u007f].
+
+5.2. XSD Regexps
+
+ Any I-Regexp is also an XSD regexp [XSD-2], so the mapping is an
+ identity function.
+
+ Note that a few errata for [XSD-2] have been fixed in [XSD-1.1-2];
+ therefore, it is also included in the Normative References
+ (Section 9.1). XSD 1.1 is less widely implemented than XSD 1.0, and
+ implementations of XSD 1.0 are likely to include these bugfixes; for
+ the intents and purposes of this specification, an implementation of
+ XSD 1.0 regexps is equivalent to an implementation of XSD 1.1
+ regexps.
+
+5.3. ECMAScript Regexps
+
+ Perform the following steps on an I-Regexp to obtain an ECMAScript
+ regexp [ECMA-262]:
+
+ * For any unescaped dots (.) outside character classes (first
+ alternative of charClass production), replace the dot with
+ [^\n\r].
+
+ * Envelope the result in ^(?: and )$.
+
+ The ECMAScript regexp is to be interpreted as a Unicode pattern ("u"
+ flag; see Section 21.2.2 "Pattern Semantics" of [ECMA-262]).
+
+ Note that where a regexp literal is required, the actual regexp needs
+ to be enclosed in /.
+
+5.4. PCRE, RE2, and Ruby Regexps
+
+ To obtain a valid regexp in Perl Compatible Regular Expressions
+ (PCRE) [PCRE2], the Go programming language's RE2 regexp library
+ [RE2], and the Ruby programming language, perform the same steps as
+ in Section 5.3, except that the last step is:
+
+ * Enclose the regexp in \A(?: and )\z.
+
+6. Motivation and Background
+
+ While regular expressions originally were intended to describe a
+ formal language to support a Boolean matching function, they have
+ been enhanced with parsing functions that support the extraction and
+ replacement of arbitrary portions of the matched text. With this
+ accretion of features, parsing-regexp libraries have become more
+ susceptible to bugs and surprising performance degradations that can
+ be exploited in denial-of-service attacks by an attacker who controls
+ the regexp submitted for processing. I-Regexp is designed to offer
+ interoperability and to be less vulnerable to such attacks, with the
+ trade-off that its only function is to offer a Boolean response as to
+ whether a character sequence is matched by a regexp.
+
+6.1. Implementing I-Regexp
+
+ XSD regexps are relatively easy to implement or map to widely
+ implemented parsing-regexp dialects, with these notable exceptions:
+
+ * Character class subtraction. This is a very useful feature in
+ many specifications, but it is unfortunately mostly absent from
+ parsing-regexp dialects. Thus, it is omitted from I-Regexp.
+
+ * Multi-character escapes. \d, \w, \s and their uppercase
+ complement classes exhibit a large amount of variation between
+ regexp flavors. Thus, they are omitted from I-Regexp.
+
+ * Not all regexp implementations support access to Unicode tables
+ that enable executing constructs such as \p{Nd}, although the
+ \p/\P feature in general is now quite widely available. While, in
+ principle, it is possible to translate these into character-class
+ matches, this also requires access to those tables. Thus, regexp
+ libraries in severely constrained environments may not be able to
+ support I-Regexp conformance.
+
+7. IANA Considerations
+
+ This document has no IANA actions.
+
+8. Security Considerations
+
+ While technically out of the scope of this specification, Section 10
+ ("Security Considerations") of RFC 3629 [STD63] applies to
+ implementations. Particular note needs to be taken of the last
+ paragraph of Section 3 ("UTF-8 definition") of RFC 3629 [STD63]; an
+ I-Regexp implementation may need to mitigate limitations of the
+ platform implementation in this regard.
+
+ As discussed in Section 6, more complex regexp libraries may contain
+ exploitable bugs, which can lead to crashes and remote code
+ execution. There is also the problem that such libraries often have
+ performance characteristics that are hard to predict, leading to
+ attacks that overload an implementation by matching against an
+ expensive attacker-controlled regexp.
+
+ I-Regexps have been designed to allow implementation in a way that is
+ resilient to both threats; this objective needs to be addressed
+ throughout the implementation effort. Non-checking implementations
+ (see Section 3.1) are likely to expose security limitations of any
+ regexp engine they use, which may be less problematic if that engine
+ has been built with security considerations in mind (e.g., [RE2]).
+ In any case, a checking implementation is still RECOMMENDED.
+
+ Implementations that specifically implement the I-Regexp subset can,
+ with care, be designed to generally run in linear time and space in
+ the input and to detect when that would not be the case (see below).
+
+ Existing regexp engines should be able to easily handle most
+ I-Regexps (after the adjustments discussed in Section 5), but may
+ consume excessive resources for some types of I-Regexps or outright
+ reject them because they cannot guarantee efficient execution. (Note
+ that different versions of the same regexp library may be more or
+ less vulnerable to excessive resource consumption for these cases.)
+
+ Specifically, range quantifiers (as in a{2,4}) provide particular
+ challenges for both existing and I-Regexp focused implementations.
+ Implementations may therefore limit range quantifiers in
+ composability (disallowing nested range quantifiers such as
+ (a{2,4}){2,4}) or range (disallowing very large ranges such as
+ a{20,200000}), or detect and reject any excessive resource
+ consumption caused by range quantifiers.
+
+ I-Regexp implementations that are used to evaluate regexps from
+ untrusted sources need to be robust in these cases. Implementers
+ using existing regexp libraries are encouraged:
+
+ * to check their documentation to see if mitigations are
+ configurable, such as limits in resource consumption, and
+
+ * to document their own degree of robustness resulting from
+ employing such mitigations.
+
+9. References
+
+9.1. Normative References
+
+ [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
+ Requirement Levels", BCP 14, RFC 2119,
+ DOI 10.17487/RFC2119, March 1997,
+ <https://www.rfc-editor.org/info/rfc2119>.
+
+ [RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
+ Specifications: ABNF", STD 68, RFC 5234,
+ DOI 10.17487/RFC5234, January 2008,
+ <https://www.rfc-editor.org/info/rfc5234>.
+
+ [RFC7405] Kyzivat, P., "Case-Sensitive String Support in ABNF",
+ RFC 7405, DOI 10.17487/RFC7405, December 2014,
+ <https://www.rfc-editor.org/info/rfc7405>.
+
+ [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
+ 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
+ May 2017, <https://www.rfc-editor.org/info/rfc8174>.
+
+ [XSD-1.1-2]
+ Peterson, D., Ed., Gao, S., Ed., Malhotra, A., Ed.,
+ Sperberg-McQueen, C. M., Ed., Thompson, H., Ed., and P.
+ Biron, Ed., "W3C XML Schema Definition Language (XSD) 1.1
+ Part 2: Datatypes", W3C REC REC-xmlschema11-2-20120405,
+ W3C REC-xmlschema11-2-20120405, 5 April 2012,
+ <https://www.w3.org/TR/2012/REC-xmlschema11-2-20120405/>.
+
+ [XSD-2] Biron, P., Ed. and A. Malhotra, Ed., "XML Schema Part 2:
+ Datatypes Second Edition", W3C REC REC-xmlschema-
+ 2-20041028, W3C REC-xmlschema-2-20041028, 28 October 2004,
+ <https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/>.
+
+9.2. Informative References
+
+ [ECMA-262] Ecma International, "ECMAScript 2020 Language
+ Specification", Standard ECMA-262, 11th Edition, June
+ 2020, <https://www.ecma-international.org/wp-
+ content/uploads/ECMA-262.pdf>.
+
+ [JSONPATH-BASE]
+ Gössner, S., Ed., Normington, G., Ed., and C. Bormann,
+ Ed., "JSONPath: Query expressions for JSON", Work in
+ Progress, Internet-Draft, draft-ietf-jsonpath-base-20, 25
+ August 2023, <https://datatracker.ietf.org/doc/html/draft-
+ ietf-jsonpath-base-20>.
+
+ [PCRE2] "Perl-compatible Regular Expressions (revised API:
+ PCRE2)", <http://pcre.org/current/doc/html/>.
+
+ [RE2] "RE2 is a fast, safe, thread-friendly alternative to
+ backtracking regular expression engines like those used in
+ PCRE, Perl, and Python. It is a C++ library.", commit
+ 73031bb, <https://github.com/google/re2>.
+
+ [RFC7493] Bray, T., Ed., "The I-JSON Message Format", RFC 7493,
+ DOI 10.17487/RFC7493, March 2015,
+ <https://www.rfc-editor.org/info/rfc7493>.
+
+ [STD63] Yergeau, F., "UTF-8, a transformation format of ISO
+ 10646", STD 63, RFC 3629, November 2003.
+
+ <https://www.rfc-editor.org/info/std63>
+
+ [UNICODE-GLOSSARY]
+ Unicode, Inc., "Glossary of Unicode Terms",
+ <https://unicode.org/glossary/>.
+
+Acknowledgements
+
+ Discussion in the IETF JSONPATH WG about whether to include a regexp
+ mechanism into the JSONPath query expression specification and
+ previous discussions about the YANG pattern and Concise Data
+ Definition Language (CDDL) .regexp features motivated this
+ specification.
+
+ The basic approach for this specification was inspired by "The I-JSON
+ Message Format" [RFC7493].
+
+Authors' Addresses
+
+ Carsten Bormann
+ Universität Bremen TZI
+ Postfach 330440
+ D-28359 Bremen
+ Germany
+ Phone: +49-421-218-63921
+ Email: cabo@tzi.org
+
+
+ Tim Bray
+ Textuality
+ Canada
+ Email: tbray@textuality.com