doc: Add RFC documents

author: Thomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
committer: Thomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
commit: 4bfd864f10b68b71482b35c818559068ef8d5797 (patch)
tree: e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc9485.txt
parent: ea76e11061bda059ae9f9ad130a9895cc85607db (diff)
1 files changed, 486 insertions, 0 deletions
diff --git a/doc/rfc/rfc9485.txt b/doc/rfc/rfc9485.txt
new file mode 100644
index 0000000..157ff1e
--- /dev/null
+++ b/doc/rfc/rfc9485.txt
@@ -0,0 +1,486 @@
+
+
+
+
+Internet Engineering Task Force (IETF)                        C. Bormann
+Request for Comments: 9485                        Universität Bremen TZI
+Category: Standards Track                                        T. Bray
+ISSN: 2070-1721                                               Textuality
+                                                            October 2023
+
+
+          I-Regexp: An Interoperable Regular Expression Format
+
+Abstract
+
+   This document specifies I-Regexp, a flavor of regular expression that
+   is limited in scope with the goal of interoperation across many
+   different regular expression libraries.
+
+Status of This Memo
+
+   This is an Internet Standards Track document.
+
+   This document is a product of the Internet Engineering Task Force
+   (IETF).  It represents the consensus of the IETF community.  It has
+   received public review and has been approved for publication by the
+   Internet Engineering Steering Group (IESG).  Further information on
+   Internet Standards is available in Section 2 of RFC 7841.
+
+   Information about the current status of this document, any errata,
+   and how to provide feedback on it may be obtained at
+   https://www.rfc-editor.org/info/rfc9485.
+
+Copyright Notice
+
+   Copyright (c) 2023 IETF Trust and the persons identified as the
+   document authors.  All rights reserved.
+
+   This document is subject to BCP 78 and the IETF Trust's Legal
+   Provisions Relating to IETF Documents
+   (https://trustee.ietf.org/license-info) in effect on the date of
+   publication of this document.  Please review these documents
+   carefully, as they describe your rights and restrictions with respect
+   to this document.  Code Components extracted from this document must
+   include Revised BSD License text as described in Section 4.e of the
+   Trust Legal Provisions and are provided without warranty as described
+   in the Revised BSD License.
+
+Table of Contents
+
+   1.  Introduction
+     1.1.  Terminology
+   2.  Objectives
+   3.  I-Regexp Syntax
+     3.1.  Checking Implementations
+   4.  I-Regexp Semantics
+   5.  Mapping I-Regexp to Regexp Dialects
+     5.1.  Multi-Character Escapes
+     5.2.  XSD Regexps
+     5.3.  ECMAScript Regexps
+     5.4.  PCRE, RE2, and Ruby Regexps
+   6.  Motivation and Background
+     6.1.  Implementing I-Regexp
+   7.  IANA Considerations
+   8.  Security Considerations
+   9.  References
+     9.1.  Normative References
+     9.2.  Informative References
+   Acknowledgements
+   Authors' Addresses
+
+1.  Introduction
+
+   This specification describes an interoperable regular expression
+   (abbreviated as "regexp") flavor, I-Regexp.
+
+   I-Regexp does not provide advanced regular expression features such
+   as capture groups, lookahead, or backreferences.  It supports only a
+   Boolean matching capability, i.e., testing whether a given regular
+   expression matches a given piece of text.
+
+   I-Regexp supports the entire repertoire of Unicode characters
+   (Unicode scalar values); both the I-Regexp strings themselves and the
+   strings they are matched against are sequences of Unicode scalar
+   values (often represented in UTF-8 encoding form [STD63] for
+   interchange).
+
+   I-Regexp is a subset of XML Schema Definition (XSD) regular
+   expressions [XSD-2].
+
+   This document includes guidance for converting I-Regexps for use with
+   several well-known regular expression idioms.
+
+   The development of I-Regexp was motivated by the work of the JSONPath
+   Working Group (WG).  The WG wanted to include support for the use of
+   regular expressions in JSONPath filters in its specification
+   [JSONPATH-BASE], but was unable to find a useful specification for
+   regular expressions that would be interoperable across the popular
+   libraries.
+
+1.1.  Terminology
+
+   This document uses the abbreviation "regexp" for what is usually
+   called a "regular expression" in programming.  The term "I-Regexp" is
+   used as a noun meaning a character string (sequence of Unicode scalar
+   values) that conforms to the requirements in this specification; the
+   plural is "I-Regexps".
+
+   This specification uses Unicode terminology; a good entry point is
+   provided by [UNICODE-GLOSSARY].
+
+   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
+   "OPTIONAL" in this document are to be interpreted as described in
+   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
+   capitals, as shown here.
+
+   The grammatical rules in this document are to be interpreted as ABNF,
+   as described in [RFC5234] and [RFC7405], where the "characters" of
+   Section 2.3 of [RFC5234] are Unicode scalar values.
+
+2.  Objectives
+
+   I-Regexps should handle the vast majority of practical cases where a
+   matching regexp is needed in a data-model specification or a query-
+   language expression.
+
+   At the time of writing, an editor of this document conducted a survey
+   of the regexp syntax used in recently published RFCs.  All examples
+   found there should be covered by I-Regexps, both syntactically and
+   with their intended semantics.  The exception is the use of multi-
+   character escapes, for which workaround guidance is provided in
+   Section 5.
+
+3.  I-Regexp Syntax
+
+   An I-Regexp MUST conform to the ABNF specification in Figure 1.
+
+   i-regexp = branch *( "|" branch )
+   branch = *piece
+   piece = atom [ quantifier ]
+   quantifier = ( "*" / "+" / "?" ) / range-quantifier
+   range-quantifier = "{" QuantExact [ "," [ QuantExact ] ] "}"
+   QuantExact = 1*%x30-39 ; '0'-'9'
+
+   atom = NormalChar / charClass / ( "(" i-regexp ")" )
+   NormalChar = ( %x00-27 / "," / "-" / %x2F-3E ; '/'-'>'
+    / %x40-5A ; '@'-'Z'
+    / %x5E-7A ; '^'-'z'
+    / %x7E-D7FF ; skip surrogate code points
+    / %xE000-10FFFF )
+   charClass = "." / SingleCharEsc / charClassEsc / charClassExpr
+   SingleCharEsc = "\" ( %x28-2B ; '('-'+'
+    / "-" / "." / "?" / %x5B-5E ; '['-'^'
+    / %s"n" / %s"r" / %s"t" / %x7B-7D ; '{'-'}'
+    )
+   charClassEsc = catEsc / complEsc
+   charClassExpr = "[" [ "^" ] ( "-" / CCE1 ) *CCE1 [ "-" ] "]"
+   CCE1 = ( CCchar [ "-" CCchar ] ) / charClassEsc
+   CCchar = ( %x00-2C / %x2E-5A ; '.'-'Z'
+    / %x5E-D7FF ; skip surrogate code points
+    / %xE000-10FFFF ) / SingleCharEsc
+   catEsc = %s"\p{" charProp "}"
+   complEsc = %s"\P{" charProp "}"
+   charProp = IsCategory
+   IsCategory = Letters / Marks / Numbers / Punctuation / Separators /
+       Symbols / Others
+   Letters = %s"L" [ ( %s"l" / %s"m" / %s"o" / %s"t" / %s"u" ) ]
+   Marks = %s"M" [ ( %s"c" / %s"e" / %s"n" ) ]
+   Numbers = %s"N" [ ( %s"d" / %s"l" / %s"o" ) ]
+   Punctuation = %s"P" [ ( %x63-66 ; 'c'-'f'
+    / %s"i" / %s"o" / %s"s" ) ]
+   Separators = %s"Z" [ ( %s"l" / %s"p" / %s"s" ) ]
+   Symbols = %s"S" [ ( %s"c" / %s"k" / %s"m" / %s"o" ) ]
+   Others = %s"C" [ ( %s"c" / %s"f" / %s"n" / %s"o" ) ]
+
+                     Figure 1: I-Regexp Syntax in ABNF
+
+   As an additional restriction, charClassExpr is not allowed to match
+   [^], which, according to this grammar, would parse as a positive
+   character class containing the single character ^.
+
+   This is essentially an XSD regexp without:
+
+   *  character class subtraction,
+
+   *  multi-character escapes such as \s, \S, and \w, and
+
+   *  Unicode blocks.
+
+   An I-Regexp implementation MUST be a complete implementation of this
+   limited subset.  In particular, full support for the Unicode
+   functionality defined in this specification is REQUIRED.  The
+   implementation:
+
+   *  MUST NOT limit itself to 7- or 8-bit character sets such as ASCII,
+      and
+
+   *  MUST support the Unicode character property set in character
+      classes.
+
+3.1.  Checking Implementations
+
+   A _checking_ I-Regexp implementation is one that checks a supplied
+   regexp for compliance with this specification and reports any
+   problems.  Checking implementations give their users confidence that
+   they didn't accidentally insert syntax that is not interoperable, so
+   checking is RECOMMENDED.  Exceptions to this rule may be made for
+   low-effort implementations that map I-Regexp to another regexp
+   library by simple steps such as performing the mapping operations
+   discussed in Section 5.  Here, the effort needed to do full checking
+   might dwarf the rest of the implementation effort.  Implementations
+   SHOULD document whether or not they are checking.
+
+   Specifications that employ I-Regexp may want to define in which cases
+   their implementations can work with a non-checking I-Regexp
+   implementation and when full checking is needed, possibly in the
+   process of defining their own implementation classes.
+
+4.  I-Regexp Semantics
+
+   This syntax is a subset of that of [XSD-2].  Implementations that
+   interpret I-Regexps MUST yield Boolean results as specified in
+   [XSD-2].  (See also Section 5.2.)
+
+5.  Mapping I-Regexp to Regexp Dialects
+
+   The material in this section is not normative; it is provided as
+   guidance to developers who want to use I-Regexps in the context of
+   other regular expression dialects.
+
+5.1.  Multi-Character Escapes
+
+   I-Regexp does not support common multi-character escapes (MCEs) and
+   character classes built around them.  These can usually be replaced
+   as shown by the examples in Table 1.
+
+                      +============+===============+
+                      | MCE/class: | Replace with: |
+                      +============+===============+
+                      | \S         | [^ \t\n\r]    |
+                      +------------+---------------+
+                      | [\S ]      | [^\t\n\r]     |
+                      +------------+---------------+
+                      | \d         | [0-9]         |
+                      +------------+---------------+
+
+                             Table 1: Example
+                          Substitutes for Multi-
+                            Character Escapes
+
+   Note that the semantics of \d in XSD regular expressions is that of
+   \p{Nd}; however, this would include all Unicode characters that are
+   digits in various writing systems, which is almost certainly not what
+   is required in IETF publications.
+
+   The construct \p{IsBasicLatin} is essentially a reference to legacy
+   ASCII; it can be replaced by the character class [\u0000-\u007f].
+
+5.2.  XSD Regexps
+
+   Any I-Regexp is also an XSD regexp [XSD-2], so the mapping is an
+   identity function.
+
+   Note that a few errata for [XSD-2] have been fixed in [XSD-1.1-2];
+   therefore, it is also included in the Normative References
+   (Section 9.1).  XSD 1.1 is less widely implemented than XSD 1.0, and
+   implementations of XSD 1.0 are likely to include these bugfixes; for
+   the intents and purposes of this specification, an implementation of
+   XSD 1.0 regexps is equivalent to an implementation of XSD 1.1
+   regexps.
+
+5.3.  ECMAScript Regexps
+
+   Perform the following steps on an I-Regexp to obtain an ECMAScript
+   regexp [ECMA-262]:
+
+   *  For any unescaped dots (.) outside character classes (first
+      alternative of charClass production), replace the dot with
+      [^\n\r].
+
+   *  Envelope the result in ^(?: and )$.
+
+   The ECMAScript regexp is to be interpreted as a Unicode pattern ("u"
+   flag; see Section 21.2.2 "Pattern Semantics" of [ECMA-262]).
+
+   Note that where a regexp literal is required, the actual regexp needs
+   to be enclosed in /.
+
+5.4.  PCRE, RE2, and Ruby Regexps
+
+   To obtain a valid regexp in Perl Compatible Regular Expressions
+   (PCRE) [PCRE2], the Go programming language's RE2 regexp library
+   [RE2], and the Ruby programming language, perform the same steps as
+   in Section 5.3, except that the last step is:
+
+   *  Enclose the regexp in \A(?: and )\z.
+
+6.  Motivation and Background
+
+   While regular expressions originally were intended to describe a
+   formal language to support a Boolean matching function, they have
+   been enhanced with parsing functions that support the extraction and
+   replacement of arbitrary portions of the matched text.  With this
+   accretion of features, parsing-regexp libraries have become more
+   susceptible to bugs and surprising performance degradations that can
+   be exploited in denial-of-service attacks by an attacker who controls
+   the regexp submitted for processing.  I-Regexp is designed to offer
+   interoperability and to be less vulnerable to such attacks, with the
+   trade-off that its only function is to offer a Boolean response as to
+   whether a character sequence is matched by a regexp.
+
+6.1.  Implementing I-Regexp
+
+   XSD regexps are relatively easy to implement or map to widely
+   implemented parsing-regexp dialects, with these notable exceptions:
+
+   *  Character class subtraction.  This is a very useful feature in
+      many specifications, but it is unfortunately mostly absent from
+      parsing-regexp dialects.  Thus, it is omitted from I-Regexp.
+
+   *  Multi-character escapes.  \d, \w, \s and their uppercase
+      complement classes exhibit a large amount of variation between
+      regexp flavors.  Thus, they are omitted from I-Regexp.
+
+   *  Not all regexp implementations support access to Unicode tables
+      that enable executing constructs such as \p{Nd}, although the
+      \p/\P feature in general is now quite widely available.  While, in
+      principle, it is possible to translate these into character-class
+      matches, this also requires access to those tables.  Thus, regexp
+      libraries in severely constrained environments may not be able to
+      support I-Regexp conformance.
+
+7.  IANA Considerations
+
+   This document has no IANA actions.
+
+8.  Security Considerations
+
+   While technically out of the scope of this specification, Section 10
+   ("Security Considerations") of RFC 3629 [STD63] applies to
+   implementations.  Particular note needs to be taken of the last
+   paragraph of Section 3 ("UTF-8 definition") of RFC 3629 [STD63]; an
+   I-Regexp implementation may need to mitigate limitations of the
+   platform implementation in this regard.
+
+   As discussed in Section 6, more complex regexp libraries may contain
+   exploitable bugs, which can lead to crashes and remote code
+   execution.  There is also the problem that such libraries often have
+   performance characteristics that are hard to predict, leading to
+   attacks that overload an implementation by matching against an
+   expensive attacker-controlled regexp.
+
+   I-Regexps have been designed to allow implementation in a way that is
+   resilient to both threats; this objective needs to be addressed
+   throughout the implementation effort.  Non-checking implementations
+   (see Section 3.1) are likely to expose security limitations of any
+   regexp engine they use, which may be less problematic if that engine
+   has been built with security considerations in mind (e.g., [RE2]).
+   In any case, a checking implementation is still RECOMMENDED.
+
+   Implementations that specifically implement the I-Regexp subset can,
+   with care, be designed to generally run in linear time and space in
+   the input and to detect when that would not be the case (see below).
+
+   Existing regexp engines should be able to easily handle most
+   I-Regexps (after the adjustments discussed in Section 5), but may
+   consume excessive resources for some types of I-Regexps or outright
+   reject them because they cannot guarantee efficient execution.  (Note
+   that different versions of the same regexp library may be more or
+   less vulnerable to excessive resource consumption for these cases.)
+
+   Specifically, range quantifiers (as in a{2,4}) provide particular
+   challenges for both existing and I-Regexp focused implementations.
+   Implementations may therefore limit range quantifiers in
+   composability (disallowing nested range quantifiers such as
+   (a{2,4}){2,4}) or range (disallowing very large ranges such as
+   a{20,200000}), or detect and reject any excessive resource
+   consumption caused by range quantifiers.
+
+   I-Regexp implementations that are used to evaluate regexps from
+   untrusted sources need to be robust in these cases.  Implementers
+   using existing regexp libraries are encouraged:
+
+   *  to check their documentation to see if mitigations are
+      configurable, such as limits in resource consumption, and
+
+   *  to document their own degree of robustness resulting from
+      employing such mitigations.
+
+9.  References
+
+9.1.  Normative References
+
+   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
+              Requirement Levels", BCP 14, RFC 2119,
+              DOI 10.17487/RFC2119, March 1997,
+              <https://www.rfc-editor.org/info/rfc2119>.
+
+   [RFC5234]  Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
+              Specifications: ABNF", STD 68, RFC 5234,
+              DOI 10.17487/RFC5234, January 2008,
+              <https://www.rfc-editor.org/info/rfc5234>.
+
+   [RFC7405]  Kyzivat, P., "Case-Sensitive String Support in ABNF",
+              RFC 7405, DOI 10.17487/RFC7405, December 2014,
+              <https://www.rfc-editor.org/info/rfc7405>.
+
+   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
+              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
+              May 2017, <https://www.rfc-editor.org/info/rfc8174>.
+
+   [XSD-1.1-2]
+              Peterson, D., Ed., Gao, S., Ed., Malhotra, A., Ed.,
+              Sperberg-McQueen, C. M., Ed., Thompson, H., Ed., and P.
+              Biron, Ed., "W3C XML Schema Definition Language (XSD) 1.1
+              Part 2: Datatypes", W3C REC REC-xmlschema11-2-20120405,
+              W3C REC-xmlschema11-2-20120405, 5 April 2012,
+              <https://www.w3.org/TR/2012/REC-xmlschema11-2-20120405/>.
+
+   [XSD-2]    Biron, P., Ed. and A. Malhotra, Ed., "XML Schema Part 2:
+              Datatypes Second Edition", W3C REC REC-xmlschema-
+              2-20041028, W3C REC-xmlschema-2-20041028, 28 October 2004,
+              <https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/>.
+
+9.2.  Informative References
+
+   [ECMA-262] Ecma International, "ECMAScript 2020 Language
+              Specification", Standard ECMA-262, 11th Edition, June
+              2020, <https://www.ecma-international.org/wp-
+              content/uploads/ECMA-262.pdf>.
+
+   [JSONPATH-BASE]
+              Gössner, S., Ed., Normington, G., Ed., and C. Bormann,
+              Ed., "JSONPath: Query expressions for JSON", Work in
+              Progress, Internet-Draft, draft-ietf-jsonpath-base-20, 25
+              August 2023, <https://datatracker.ietf.org/doc/html/draft-
+              ietf-jsonpath-base-20>.
+
+   [PCRE2]    "Perl-compatible Regular Expressions (revised API:
+              PCRE2)", <http://pcre.org/current/doc/html/>.
+
+   [RE2]      "RE2 is a fast, safe, thread-friendly alternative to
+              backtracking regular expression engines like those used in
+              PCRE, Perl, and Python. It is a C++ library.", commit
+              73031bb, <https://github.com/google/re2>.
+
+   [RFC7493]  Bray, T., Ed., "The I-JSON Message Format", RFC 7493,
+              DOI 10.17487/RFC7493, March 2015,
+              <https://www.rfc-editor.org/info/rfc7493>.
+
+   [STD63]    Yergeau, F., "UTF-8, a transformation format of ISO
+              10646", STD 63, RFC 3629, November 2003.
+
+              <https://www.rfc-editor.org/info/std63>
+
+   [UNICODE-GLOSSARY]
+              Unicode, Inc., "Glossary of Unicode Terms",
+              <https://unicode.org/glossary/>.
+
+Acknowledgements
+
+   Discussion in the IETF JSONPATH WG about whether to include a regexp
+   mechanism into the JSONPath query expression specification and
+   previous discussions about the YANG pattern and Concise Data
+   Definition Language (CDDL) .regexp features motivated this
+   specification.
+
+   The basic approach for this specification was inspired by "The I-JSON
+   Message Format" [RFC7493].
+
+Authors' Addresses
+
+   Carsten Bormann
+   Universität Bremen TZI
+   Postfach 330440
+   D-28359 Bremen
+   Germany
+   Phone: +49-421-218-63921
+   Email: cabo@tzi.org
+
+
+   Tim Bray
+   Textuality
+   Canada
+   Email: tbray@textuality.com
author	Thomas Voss <mail@thomasvoss.com>	2024-11-27 20:54:24 +0100
committer	Thomas Voss <mail@thomasvoss.com>	2024-11-27 20:54:24 +0100
commit	4bfd864f10b68b71482b35c818559068ef8d5797 (patch)
tree	e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc9485.txt
parent	ea76e11061bda059ae9f9ad130a9895cc85607db (diff)