From 4bfd864f10b68b71482b35c818559068ef8d5797 Mon Sep 17 00:00:00 2001 From: Thomas Voss Date: Wed, 27 Nov 2024 20:54:24 +0100 Subject: doc: Add RFC documents --- doc/rfc/rfc8949.txt | 3674 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 3674 insertions(+) create mode 100644 doc/rfc/rfc8949.txt (limited to 'doc/rfc/rfc8949.txt') diff --git a/doc/rfc/rfc8949.txt b/doc/rfc/rfc8949.txt new file mode 100644 index 0000000..143d1e3 --- /dev/null +++ b/doc/rfc/rfc8949.txt @@ -0,0 +1,3674 @@ + + + + +Internet Engineering Task Force (IETF) C. Bormann +Request for Comments: 8949 Universität Bremen TZI +STD: 94 P. Hoffman +Obsoletes: 7049 ICANN +Category: Standards Track December 2020 +ISSN: 2070-1721 + + + Concise Binary Object Representation (CBOR) + +Abstract + + The Concise Binary Object Representation (CBOR) is a data format + whose design goals include the possibility of extremely small code + size, fairly small message size, and extensibility without the need + for version negotiation. These design goals make it different from + earlier binary serializations such as ASN.1 and MessagePack. + + This document obsoletes RFC 7049, providing editorial improvements, + new details, and errata fixes while keeping full compatibility with + the interchange format of RFC 7049. It does not create a new version + of the format. + +Status of This Memo + + This is an Internet Standards Track document. + + This document is a product of the Internet Engineering Task Force + (IETF). It represents the consensus of the IETF community. It has + received public review and has been approved for publication by the + Internet Engineering Steering Group (IESG). Further information on + Internet Standards is available in Section 2 of RFC 7841. + + Information about the current status of this document, any errata, + and how to provide feedback on it may be obtained at + https://www.rfc-editor.org/info/rfc8949. + +Copyright Notice + + Copyright (c) 2020 IETF Trust and the persons identified as the + document authors. All rights reserved. + + This document is subject to BCP 78 and the IETF Trust's Legal + Provisions Relating to IETF Documents + (https://trustee.ietf.org/license-info) in effect on the date of + publication of this document. Please review these documents + carefully, as they describe your rights and restrictions with respect + to this document. Code Components extracted from this document must + include Simplified BSD License text as described in Section 4.e of + the Trust Legal Provisions and are provided without warranty as + described in the Simplified BSD License. + +Table of Contents + + 1. Introduction + 1.1. Objectives + 1.2. Terminology + 2. CBOR Data Models + 2.1. Extended Generic Data Models + 2.2. Specific Data Models + 3. Specification of the CBOR Encoding + 3.1. Major Types + 3.2. Indefinite Lengths for Some Major Types + 3.2.1. The "break" Stop Code + 3.2.2. Indefinite-Length Arrays and Maps + 3.2.3. Indefinite-Length Byte Strings and Text Strings + 3.2.4. Summary of Indefinite-Length Use of Major Types + 3.3. Floating-Point Numbers and Values with No Content + 3.4. Tagging of Items + 3.4.1. Standard Date/Time String + 3.4.2. Epoch-Based Date/Time + 3.4.3. Bignums + 3.4.4. Decimal Fractions and Bigfloats + 3.4.5. Content Hints + 3.4.5.1. Encoded CBOR Data Item + 3.4.5.2. Expected Later Encoding for CBOR-to-JSON Converters + 3.4.5.3. Encoded Text + 3.4.6. Self-Described CBOR + 4. Serialization Considerations + 4.1. Preferred Serialization + 4.2. Deterministically Encoded CBOR + 4.2.1. Core Deterministic Encoding Requirements + 4.2.2. Additional Deterministic Encoding Considerations + 4.2.3. Length-First Map Key Ordering + 5. Creating CBOR-Based Protocols + 5.1. CBOR in Streaming Applications + 5.2. Generic Encoders and Decoders + 5.3. Validity of Items + 5.3.1. Basic validity + 5.3.2. Tag validity + 5.4. Validity and Evolution + 5.5. Numbers + 5.6. Specifying Keys for Maps + 5.6.1. Equivalence of Keys + 5.7. Undefined Values + 6. Converting Data between CBOR and JSON + 6.1. Converting from CBOR to JSON + 6.2. Converting from JSON to CBOR + 7. Future Evolution of CBOR + 7.1. Extension Points + 7.2. Curating the Additional Information Space + 8. Diagnostic Notation + 8.1. Encoding Indicators + 9. IANA Considerations + 9.1. CBOR Simple Values Registry + 9.2. CBOR Tags Registry + 9.3. Media Types Registry + 9.4. CoAP Content-Format Registry + 9.5. Structured Syntax Suffix Registry + 10. Security Considerations + 11. References + 11.1. Normative References + 11.2. Informative References + Appendix A. Examples of Encoded CBOR Data Items + Appendix B. Jump Table for Initial Byte + Appendix C. Pseudocode + Appendix D. Half-Precision + Appendix E. Comparison of Other Binary Formats to CBOR's Design + Objectives + E.1. ASN.1 DER, BER, and PER + E.2. MessagePack + E.3. BSON + E.4. MSDTP: RFC 713 + E.5. Conciseness on the Wire + Appendix F. Well-Formedness Errors and Examples + F.1. Examples of CBOR Data Items That Are Not Well-Formed + Appendix G. Changes from RFC 7049 + G.1. Errata Processing and Clerical Changes + G.2. Changes in IANA Considerations + G.3. Changes in Suggestions and Other Informational Components + Acknowledgements + Authors' Addresses + +1. Introduction + + There are hundreds of standardized formats for binary representation + of structured data (also known as binary serialization formats). Of + those, some are for specific domains of information, while others are + generalized for arbitrary data. In the IETF, probably the best-known + formats in the latter category are ASN.1's BER and DER [ASN.1]. + + The format defined here follows some specific design goals that are + not well met by current formats. The underlying data model is an + extended version of the JSON data model [RFC8259]. It is important + to note that this is not a proposal that the grammar in RFC 8259 be + extended in general, since doing so would cause a significant + backwards incompatibility with already deployed JSON documents. + Instead, this document simply defines its own data model that starts + from JSON. + + Appendix E lists some existing binary formats and discusses how well + they do or do not fit the design objectives of the Concise Binary + Object Representation (CBOR). + + This document obsoletes [RFC7049], providing editorial improvements, + new details, and errata fixes while keeping full compatibility with + the interchange format of RFC 7049. It does not create a new version + of the format. + +1.1. Objectives + + The objectives of CBOR, roughly in decreasing order of importance, + are: + + 1. The representation must be able to unambiguously encode most + common data formats used in Internet standards. + + * It must represent a reasonable set of basic data types and + structures using binary encoding. "Reasonable" here is + largely influenced by the capabilities of JSON, with the major + addition of binary byte strings. The structures supported are + limited to arrays and trees; loops and lattice-style graphs + are not supported. + + * There is no requirement that all data formats be uniquely + encoded; that is, it is acceptable that the number "7" might + be encoded in multiple different ways. + + 2. The code for an encoder or decoder must be able to be compact in + order to support systems with very limited memory, processor + power, and instruction sets. + + * An encoder and a decoder need to be implementable in a very + small amount of code (for example, in class 1 constrained + nodes as defined in [RFC7228]). + + * The format should use contemporary machine representations of + data (for example, not requiring binary-to-decimal + conversion). + + 3. Data must be able to be decoded without a schema description. + + * Similar to JSON, encoded data should be self-describing so + that a generic decoder can be written. + + 4. The serialization must be reasonably compact, but data + compactness is secondary to code compactness for the encoder and + decoder. + + * "Reasonable" here is bounded by JSON as an upper bound in size + and by the implementation complexity, which limits the amount + of effort that can go into achieving that compactness. Using + either general compression schemes or extensive bit-fiddling + violates the complexity goals. + + 5. The format must be applicable to both constrained nodes and high- + volume applications. + + * This means it must be reasonably frugal in CPU usage for both + encoding and decoding. This is relevant both for constrained + nodes and for potential usage in applications with a very high + volume of data. + + 6. The format must support all JSON data types for conversion to and + from JSON. + + * It must support a reasonable level of conversion as long as + the data represented is within the capabilities of JSON. It + must be possible to define a unidirectional mapping towards + JSON for all types of data. + + 7. The format must be extensible, and the extended data must be + decodable by earlier decoders. + + * The format is designed for decades of use. + + * The format must support a form of extensibility that allows + fallback so that a decoder that does not understand an + extension can still decode the message. + + * The format must be able to be extended in the future by later + IETF standards. + +1.2. Terminology + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", + "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and + "OPTIONAL" in this document are to be interpreted as described in + BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all + capitals, as shown here. + + The term "byte" is used in its now-customary sense as a synonym for + "octet". All multi-byte values are encoded in network byte order + (that is, most significant byte first, also known as "big-endian"). + + This specification makes use of the following terminology: + + Data item: A single piece of CBOR data. The structure of a data + item may contain zero, one, or more nested data items. The term + is used both for the data item in representation format and for + the abstract idea that can be derived from that by a decoder; the + former can be addressed specifically by using the term "encoded + data item". + + Decoder: A process that decodes a well-formed encoded CBOR data item + and makes it available to an application. Formally speaking, a + decoder contains a parser to break up the input using the syntax + rules of CBOR, as well as a semantic processor to prepare the data + in a form suitable to the application. + + Encoder: A process that generates the (well-formed) representation + format of a CBOR data item from application information. + + Data Stream: A sequence of zero or more data items, not further + assembled into a larger containing data item (see [RFC8742] for + one application). The independent data items that make up a data + stream are sometimes also referred to as "top-level data items". + + Well-formed: A data item that follows the syntactic structure of + CBOR. A well-formed data item uses the initial bytes and the byte + strings and/or data items that are implied by their values as + defined in CBOR and does not include following extraneous data. + CBOR decoders by definition only return contents from well-formed + data items. + + Valid: A data item that is well-formed and also follows the semantic + restrictions that apply to CBOR data items (Section 5.3). + + Expected: Besides its normal English meaning, the term "expected" is + used to describe requirements beyond CBOR validity that an + application has on its input data. Well-formed (processable at + all), valid (checked by a validity-checking generic decoder), and + expected (checked by the application) form a hierarchy of layers + of acceptability. + + Stream decoder: A process that decodes a data stream and makes each + of the data items in the sequence available to an application as + they are received. + + Terms and concepts for floating-point values such as Infinity, NaN + (not a number), negative zero, and subnormal are defined in + [IEEE754]. + + Where bit arithmetic or data types are explained, this document uses + the notation familiar from the programming language C [C], except + that ".." denotes a range that includes both ends given, and + superscript notation denotes exponentiation. For example, 2 to the + power of 64 is notated: 2^(64). In the plain-text version of this + specification, superscript notation is not available and therefore is + rendered by a surrogate notation. That notation is not optimized for + this RFC; it is unfortunately ambiguous with C's exclusive-or (which + is only used in the appendices, which in turn do not use + exponentiation) and requires circumspection from the reader of the + plain-text version. + + Examples and pseudocode assume that signed integers use two's + complement representation and that right shifts of signed integers + perform sign extension; these assumptions are also specified in + Sections 6.8.1 (basic.fundamental) and 7.6.7 (expr.shift) of the 2020 + version of C++ (currently available as a final draft, [Cplusplus20]). + + Similar to the "0x" notation for hexadecimal numbers, numbers in + binary notation are prefixed with "0b". Underscores can be added to + a number solely for readability, so 0b00100001 (0x21) might be + written 0b001_00001 to emphasize the desired interpretation of the + bits in the byte; in this case, it is split into three bits and five + bits. Encoded CBOR data items are sometimes given in the "0x" or + "0b" notation; these values are first interpreted as numbers as in C + and are then interpreted as byte strings in network byte order, + including any leading zero bytes expressed in the notation. + + Words may be _italicized_ for emphasis; in the plain text form of + this specification, this is indicated by surrounding words with + underscore characters. Verbatim text (e.g., names from a programming + language) may be set in "monospace" type; in plain text, this is + approximated somewhat ambiguously by surrounding the text in double + quotes (which also retain their usual meaning). + +2. CBOR Data Models + + CBOR is explicit about its generic data model, which defines the set + of all data items that can be represented in CBOR. Its basic generic + data model is extensible by the registration of "simple values" and + tags. Applications can then create a subset of the resulting + extended generic data model to build their specific data models. + + Within environments that can represent the data items in the generic + data model, generic CBOR encoders and decoders can be implemented + (which usually involves defining additional implementation data types + for those data items that do not already have a natural + representation in the environment). The ability to provide generic + encoders and decoders is an explicit design goal of CBOR; however, + many applications will provide their own application-specific + encoders and/or decoders. + + In the basic (unextended) generic data model defined in Section 3, a + data item is one of the following: + + * an integer in the range -2^(64)..2^(64)-1 inclusive + + * a simple value, identified by a number between 0 and 255, but + distinct from that number itself + + * a floating-point value, distinct from an integer, out of the set + representable by IEEE 754 binary64 (including non-finites) + [IEEE754] + + * a sequence of zero or more bytes ("byte string") + + * a sequence of zero or more Unicode code points ("text string") + + * a sequence of zero or more data items ("array") + + * a mapping (mathematical function) from zero or more data items + ("keys") each to a data item ("values"), ("map") + + * a tagged data item ("tag"), comprising a tag number (an integer in + the range 0..2^(64)-1) and the tag content (a data item) + + Note that integer and floating-point values are distinct in this + model, even if they have the same numeric value. + + Also note that serialization variants are not visible at the generic + data model level. This deliberate absence of visibility includes the + number of bytes of the encoded floating-point value. It also + includes the choice of encoding for an "argument" (see Section 3) + such as the encoding for an integer, the encoding for the length of a + text or byte string, the encoding for the number of elements in an + array or pairs in a map, or the encoding for a tag number. + +2.1. Extended Generic Data Models + + This basic generic data model has been extended in this document by + the registration of a number of simple values and tag numbers, such + as: + + * "false", "true", "null", and "undefined" (simple values identified + by 20..23, Section 3.3) + + * integer and floating-point values with a larger range and + precision than the above (tag numbers 2 to 5, Section 3.4) + + * application data types such as a point in time or date/time string + defined in RFC 3339 (tag numbers 1 and 0, Section 3.4) + + Additional elements of the extended generic data model can be (and + have been) defined via the IANA registries created for CBOR. Even if + such an extension is unknown to a generic encoder or decoder, data + items using that extension can be passed to or from the application + by representing them at the application interface within the basic + generic data model, i.e., as generic simple values or generic tags. + + In other words, the basic generic data model is stable as defined in + this document, while the extended generic data model expands by the + registration of new simple values or tag numbers, but never shrinks. + + While there is a strong expectation that generic encoders and + decoders can represent "false", "true", and "null" ("undefined" is + intentionally omitted) in the form appropriate for their programming + environment, the implementation of the data model extensions created + by tags is truly optional and a matter of implementation quality. + +2.2. Specific Data Models + + The specific data model for a CBOR-based protocol usually takes a + subset of the extended generic data model and assigns application + semantics to the data items within this subset and its components. + When documenting such specific data models and specifying the types + of data items, it is preferable to identify the types by their + generic data model names ("negative integer", "array") instead of + referring to aspects of their CBOR representation ("major type 1", + "major type 4"). + + Specific data models can also specify value equivalency (including + values of different types) for the purposes of map keys and encoder + freedom. For example, in the generic data model, a valid map MAY + have both "0" and "0.0" as keys, and an encoder MUST NOT encode "0.0" + as an integer (major type 0, Section 3.1). However, if a specific + data model declares that floating-point and integer representations + of integral values are equivalent, using both map keys "0" and "0.0" + in a single map would be considered duplicates, even while encoded as + different major types, and so invalid; and an encoder could encode + integral-valued floats as integers or vice versa, perhaps to save + encoded bytes. + +3. Specification of the CBOR Encoding + + A CBOR data item (Section 2) is encoded to or decoded from a byte + string carrying a well-formed encoded data item as described in this + section. The encoding is summarized in Table 7 in Appendix B, + indexed by the initial byte. An encoder MUST produce only well- + formed encoded data items. A decoder MUST NOT return a decoded data + item when it encounters input that is not a well-formed encoded CBOR + data item (this does not detract from the usefulness of diagnostic + and recovery tools that might make available some information from a + damaged encoded CBOR data item). + + The initial byte of each encoded data item contains both information + about the major type (the high-order 3 bits, described in + Section 3.1) and additional information (the low-order 5 bits). With + a few exceptions, the additional information's value describes how to + load an unsigned integer "argument": + + Less than 24: The argument's value is the value of the additional + information. + + 24, 25, 26, or 27: The argument's value is held in the following 1, + 2, 4, or 8 bytes, respectively, in network byte order. For major + type 7 and additional information value 25, 26, 27, these bytes + are not used as an integer argument, but as a floating-point value + (see Section 3.3). + + 28, 29, 30: These values are reserved for future additions to the + CBOR format. In the present version of CBOR, the encoded item is + not well-formed. + + 31: No argument value is derived. If the major type is 0, 1, or 6, + the encoded item is not well-formed. For major types 2 to 5, the + item's length is indefinite, and for major type 7, the byte does + not constitute a data item at all but terminates an indefinite- + length item; all are described in Section 3.2. + + The initial byte and any additional bytes consumed to construct the + argument are collectively referred to as the _head_ of the data item. + + The meaning of this argument depends on the major type. For example, + in major type 0, the argument is the value of the data item itself + (and in major type 1, the value of the data item is computed from the + argument); in major type 2 and 3, it gives the length of the string + data in bytes that follow; and in major types 4 and 5, it is used to + determine the number of data items enclosed. + + If the encoded sequence of bytes ends before the end of a data item, + that item is not well-formed. If the encoded sequence of bytes still + has bytes remaining after the outermost encoded item is decoded, that + encoding is not a single well-formed CBOR item. Depending on the + application, the decoder may either treat the encoding as not well- + formed or just identify the start of the remaining bytes to the + application. + + A CBOR decoder implementation can be based on a jump table with all + 256 defined values for the initial byte (Table 7). A decoder in a + constrained implementation can instead use the structure of the + initial byte and following bytes for more compact code (see + Appendix C for a rough impression of how this could look). + +3.1. Major Types + + The following lists the major types and the additional information + and other bytes associated with the type. + + Major type 0: + An unsigned integer in the range 0..2^(64)-1 inclusive. The value + of the encoded item is the argument itself. For example, the + integer 10 is denoted as the one byte 0b000_01010 (major type 0, + additional information 10). The integer 500 would be 0b000_11001 + (major type 0, additional information 25) followed by the two + bytes 0x01f4, which is 500 in decimal. + + Major type 1: + A negative integer in the range -2^(64)..-1 inclusive. The value + of the item is -1 minus the argument. For example, the integer + -500 would be 0b001_11001 (major type 1, additional information + 25) followed by the two bytes 0x01f3, which is 499 in decimal. + + Major type 2: + A byte string. The number of bytes in the string is equal to the + argument. For example, a byte string whose length is 5 would have + an initial byte of 0b010_00101 (major type 2, additional + information 5 for the length), followed by 5 bytes of binary + content. A byte string whose length is 500 would have 3 initial + bytes of 0b010_11001 (major type 2, additional information 25 to + indicate a two-byte length) followed by the two bytes 0x01f4 for a + length of 500, followed by 500 bytes of binary content. + + Major type 3: + A text string (Section 2) encoded as UTF-8 [RFC3629]. The number + of bytes in the string is equal to the argument. A string + containing an invalid UTF-8 sequence is well-formed but invalid + (Section 1.2). This type is provided for systems that need to + interpret or display human-readable text, and allows the + differentiation between unstructured bytes and text that has a + specified repertoire (that of Unicode) and encoding (UTF-8). In + contrast to formats such as JSON, the Unicode characters in this + type are never escaped. Thus, a newline character (U+000A) is + always represented in a string as the byte 0x0a, and never as the + bytes 0x5c6e (the characters "\" and "n") nor as 0x5c7530303061 + (the characters "\", "u", "0", "0", "0", and "a"). + + Major type 4: + An array of data items. In other formats, arrays are also called + lists, sequences, or tuples (a "CBOR sequence" is something + slightly different, though [RFC8742]). The argument is the number + of data items in the array. Items in an array do not need to all + be of the same type. For example, an array that contains 10 items + of any type would have an initial byte of 0b100_01010 (major type + 4, additional information 10 for the length) followed by the 10 + remaining items. + + Major type 5: + A map of pairs of data items. Maps are also called tables, + dictionaries, hashes, or objects (in JSON). A map is comprised of + pairs of data items, each pair consisting of a key that is + immediately followed by a value. The argument is the number of + _pairs_ of data items in the map. For example, a map that + contains 9 pairs would have an initial byte of 0b101_01001 (major + type 5, additional information 9 for the number of pairs) followed + by the 18 remaining items. The first item is the first key, the + second item is the first value, the third item is the second key, + and so on. Because items in a map come in pairs, their total + number is always even: a map that contains an odd number of items + (no value data present after the last key data item) is not well- + formed. A map that has duplicate keys may be well-formed, but it + is not valid, and thus it causes indeterminate decoding; see also + Section 5.6. + + Major type 6: + A tagged data item ("tag") whose tag number, an integer in the + range 0..2^(64)-1 inclusive, is the argument and whose enclosed + data item (_tag content_) is the single encoded data item that + follows the head. See Section 3.4. + + Major type 7: + Floating-point numbers and simple values, as well as the "break" + stop code. See Section 3.3. + + These eight major types lead to a simple table showing which of the + 256 possible values for the initial byte of a data item are used + (Table 7). + + In major types 6 and 7, many of the possible values are reserved for + future specification. See Section 9 for more information on these + values. + + Table 1 summarizes the major types defined by CBOR, ignoring + Section 3.2 for now. The number N in this table stands for the + argument. + + +============+=======================+=========================+ + | Major Type | Meaning | Content | + +============+=======================+=========================+ + | 0 | unsigned integer N | - | + +------------+-----------------------+-------------------------+ + | 1 | negative integer -1-N | - | + +------------+-----------------------+-------------------------+ + | 2 | byte string | N bytes | + +------------+-----------------------+-------------------------+ + | 3 | text string | N bytes (UTF-8 text) | + +------------+-----------------------+-------------------------+ + | 4 | array | N data items (elements) | + +------------+-----------------------+-------------------------+ + | 5 | map | 2N data items (key/ | + | | | value pairs) | + +------------+-----------------------+-------------------------+ + | 6 | tag of number N | 1 data item | + +------------+-----------------------+-------------------------+ + | 7 | simple/float | - | + +------------+-----------------------+-------------------------+ + + Table 1: Overview over the Definite-Length Use of CBOR Major + Types (N = Argument) + +3.2. Indefinite Lengths for Some Major Types + + Four CBOR items (arrays, maps, byte strings, and text strings) can be + encoded with an indefinite length using additional information value + 31. This is useful if the encoding of the item needs to begin before + the number of items inside the array or map, or the total length of + the string, is known. (The ability to start sending a data item + before all of it is known is often referred to as "streaming" within + that data item.) + + Indefinite-length arrays and maps are dealt with differently than + indefinite-length strings (byte strings and text strings). + +3.2.1. The "break" Stop Code + + The "break" stop code is encoded with major type 7 and additional + information value 31 (0b111_11111). It is not itself a data item: it + is just a syntactic feature to close an indefinite-length item. + + If the "break" stop code appears where a data item is expected, other + than directly inside an indefinite-length string, array, or map -- + for example, directly inside a definite-length array or map -- the + enclosing item is not well-formed. + +3.2.2. Indefinite-Length Arrays and Maps + + Indefinite-length arrays and maps are represented using their major + type with the additional information value of 31, followed by an + arbitrary-length sequence of zero or more items for an array or key/ + value pairs for a map, followed by the "break" stop code + (Section 3.2.1). In other words, indefinite-length arrays and maps + look identical to other arrays and maps except for beginning with the + additional information value of 31 and ending with the "break" stop + code. + + If the "break" stop code appears after a key in a map, in place of + that key's value, the map is not well-formed. + + There is no restriction against nesting indefinite-length array or + map items. A "break" only terminates a single item, so nested + indefinite-length items need exactly as many "break" stop codes as + there are type bytes starting an indefinite-length item. + + For example, assume an encoder wants to represent the abstract array + [1, [2, 3], [4, 5]]. The definite-length encoding would be + 0x8301820203820405: + + 83 -- Array of length 3 + 01 -- 1 + 82 -- Array of length 2 + 02 -- 2 + 03 -- 3 + 82 -- Array of length 2 + 04 -- 4 + 05 -- 5 + + Indefinite-length encoding could be applied independently to each of + the three arrays encoded in this data item, as required, leading to + representations such as: + + 0x9f018202039f0405ffff + 9F -- Start indefinite-length array + 01 -- 1 + 82 -- Array of length 2 + 02 -- 2 + 03 -- 3 + 9F -- Start indefinite-length array + 04 -- 4 + 05 -- 5 + FF -- "break" (inner array) + FF -- "break" (outer array) + + 0x9f01820203820405ff + 9F -- Start indefinite-length array + 01 -- 1 + 82 -- Array of length 2 + 02 -- 2 + 03 -- 3 + 82 -- Array of length 2 + 04 -- 4 + 05 -- 5 + FF -- "break" + + 0x83018202039f0405ff + 83 -- Array of length 3 + 01 -- 1 + 82 -- Array of length 2 + 02 -- 2 + 03 -- 3 + 9F -- Start indefinite-length array + 04 -- 4 + 05 -- 5 + FF -- "break" + + 0x83019f0203ff820405 + 83 -- Array of length 3 + 01 -- 1 + 9F -- Start indefinite-length array + 02 -- 2 + 03 -- 3 + FF -- "break" + 82 -- Array of length 2 + 04 -- 4 + 05 -- 5 + + An example of an indefinite-length map (that happens to have two key/ + value pairs) might be: + + 0xbf6346756ef563416d7421ff + BF -- Start indefinite-length map + 63 -- First key, UTF-8 string length 3 + 46756e -- "Fun" + F5 -- First value, true + 63 -- Second key, UTF-8 string length 3 + 416d74 -- "Amt" + 21 -- Second value, -2 + FF -- "break" + +3.2.3. Indefinite-Length Byte Strings and Text Strings + + Indefinite-length strings are represented by a byte containing the + major type for byte string or text string with an additional + information value of 31, followed by a series of zero or more strings + of the specified type ("chunks") that have definite lengths, and + finished by the "break" stop code (Section 3.2.1). The data item + represented by the indefinite-length string is the concatenation of + the chunks. If no chunks are present, the data item is an empty + string of the specified type. Zero-length chunks, while not + particularly useful, are permitted. + + If any item between the indefinite-length string indicator + (0b010_11111 or 0b011_11111) and the "break" stop code is not a + definite-length string item of the same major type, the string is not + well-formed. + + The design does not allow nesting indefinite-length strings as chunks + into indefinite-length strings. If it were allowed, it would require + decoder implementations to keep a stack, or at least a count, of + nesting levels. It is unnecessary on the encoder side because the + inner indefinite-length string would consist of chunks, and these + could instead be put directly into the outer indefinite-length + string. + + If any definite-length text string inside an indefinite-length text + string is invalid, the indefinite-length text string is invalid. + Note that this implies that the UTF-8 bytes of a single Unicode code + point (scalar value) cannot be spread between chunks: a new chunk of + a text string can only be started at a code point boundary. + + For example, assume an encoded data item consisting of the bytes: + + 0b010_11111 0b010_00100 0xaabbccdd 0b010_00011 0xeeff99 0b111_11111 + 5F -- Start indefinite-length byte string + 44 -- Byte string of length 4 + aabbccdd -- Bytes content + 43 -- Byte string of length 3 + eeff99 -- Bytes content + FF -- "break" + + After decoding, this results in a single byte string with seven + bytes: 0xaabbccddeeff99. + +3.2.4. Summary of Indefinite-Length Use of Major Types + + Table 2 summarizes the major types defined by CBOR as used for + indefinite-length encoding (with additional information set to 31). + + +============+===================+==================================+ + | Major Type | Meaning | Enclosed up to "break" Stop Code | + +============+===================+==================================+ + | 0 | (not well- | - | + | | formed) | | + +------------+-------------------+----------------------------------+ + | 1 | (not well- | - | + | | formed) | | + +------------+-------------------+----------------------------------+ + | 2 | byte string | definite-length byte strings | + +------------+-------------------+----------------------------------+ + | 3 | text string | definite-length text strings | + +------------+-------------------+----------------------------------+ + | 4 | array | data items (elements) | + +------------+-------------------+----------------------------------+ + | 5 | map | data items (key/value pairs) | + +------------+-------------------+----------------------------------+ + | 6 | (not well- | - | + | | formed) | | + +------------+-------------------+----------------------------------+ + | 7 | "break" stop | - | + | | code | | + +------------+-------------------+----------------------------------+ + + Table 2: Overview of the Indefinite-Length Use of CBOR Major + Types (Additional Information = 31) + +3.3. Floating-Point Numbers and Values with No Content + + Major type 7 is for two types of data: floating-point numbers and + "simple values" that do not need any content. Each value of the + 5-bit additional information in the initial byte has its own separate + meaning, as defined in Table 3. Like the major types for integers, + items of this major type do not carry content data; all the + information is in the initial bytes (the head). + + +=============+===================================================+ + | 5-Bit Value | Semantics | + +=============+===================================================+ + | 0..23 | Simple value (value 0..23) | + +-------------+---------------------------------------------------+ + | 24 | Simple value (value 32..255 in following byte) | + +-------------+---------------------------------------------------+ + | 25 | IEEE 754 Half-Precision Float (16 bits follow) | + +-------------+---------------------------------------------------+ + | 26 | IEEE 754 Single-Precision Float (32 bits follow) | + +-------------+---------------------------------------------------+ + | 27 | IEEE 754 Double-Precision Float (64 bits follow) | + +-------------+---------------------------------------------------+ + | 28-30 | Reserved, not well-formed in the present document | + +-------------+---------------------------------------------------+ + | 31 | "break" stop code for indefinite-length items | + | | (Section 3.2.1) | + +-------------+---------------------------------------------------+ + + Table 3: Values for Additional Information in Major Type 7 + + As with all other major types, the 5-bit value 24 signifies a single- + byte extension: it is followed by an additional byte to represent the + simple value. (To minimize confusion, only the values 32 to 255 are + used.) This maintains the structure of the initial bytes: as for the + other major types, the length of these always depends on the + additional information in the first byte. Table 4 lists the numeric + values assigned and available for simple values. + + +=========+==============+ + | Value | Semantics | + +=========+==============+ + | 0..19 | (unassigned) | + +---------+--------------+ + | 20 | false | + +---------+--------------+ + | 21 | true | + +---------+--------------+ + | 22 | null | + +---------+--------------+ + | 23 | undefined | + +---------+--------------+ + | 24..31 | (reserved) | + +---------+--------------+ + | 32..255 | (unassigned) | + +---------+--------------+ + + Table 4: Simple Values + + An encoder MUST NOT issue two-byte sequences that start with 0xf8 + (major type 7, additional information 24) and continue with a byte + less than 0x20 (32 decimal). Such sequences are not well-formed. + (This implies that an encoder cannot encode "false", "true", "null", + or "undefined" in two-byte sequences and that only the one-byte + variants of these are well-formed; more generally speaking, each + simple value only has a single representation variant). + + The 5-bit values of 25, 26, and 27 are for 16-bit, 32-bit, and 64-bit + IEEE 754 binary floating-point values [IEEE754]. These floating- + point values are encoded in the additional bytes of the appropriate + size. (See Appendix D for some information about 16-bit floating- + point numbers.) + +3.4. Tagging of Items + + In CBOR, a data item can be enclosed by a tag to give it some + additional semantics, as uniquely identified by a _tag number_. The + tag is major type 6, its argument (Section 3) indicates the tag + number, and it contains a single enclosed data item, the _tag + content_. (If a tag requires further structure to its content, this + structure is provided by the enclosed data item.) We use the term + _tag_ for the entire data item consisting of both a tag number and + the tag content: the tag content is the data item that is being + tagged. + + For example, assume that a byte string of length 12 is marked with a + tag of number 2 to indicate it is an unsigned _bignum_ + (Section 3.4.3). The encoded data item would start with a byte + 0b110_00010 (major type 6, additional information 2 for the tag + number) followed by the encoded tag content: 0b010_01100 (major type + 2, additional information 12 for the length) followed by the 12 bytes + of the bignum. + + In the extended generic data model, a tag number's definition + describes the additional semantics conveyed with the tag number. + These semantics may include equivalence of some tagged data items + with other data items, including some that can be represented in the + basic generic data model. For instance, 0xc24101, a bignum the tag + content of which is the byte string with the single byte 0x01, is + equivalent to an integer 1, which could also be encoded as 0x01, + 0x1801, or 0x190001. The tag definition may specify a preferred + serialization (Section 4.1) that is recommended for generic encoders; + this may prefer basic generic data model representations over ones + that employ a tag. + + The tag definition usually defines which nested data items are valid + for such tags. Tag definitions may restrict their content to a very + specific syntactic structure, as the tags defined in this document + do, or they may define their content more semantically. An example + for the latter is how tags 40 and 1040 accept multiple ways to + represent arrays [RFC8746]. + + As a matter of convention, many tags do not accept "null" or + "undefined" values as tag content; instead, the expectation is that a + "null" or "undefined" value can be used in place of the entire tag; + Section 3.4.2 provides some further considerations for one specific + tag about the handling of this convention in application protocols + and in mapping to platform types. + + Decoders do not need to understand tags of every tag number, and tags + may be of little value in applications where the implementation + creating a particular CBOR data item and the implementation decoding + that stream know the semantic meaning of each item in the data flow. + The primary purpose of tags in this specification is to define common + data types such as dates. A secondary purpose is to provide + conversion hints when it is foreseen that the CBOR data item needs to + be translated into a different format, requiring hints about the + content of items. Understanding the semantics of tags is optional + for a decoder; it can simply present both the tag number and the tag + content to the application, without interpreting the additional + semantics of the tag. + + A tag applies semantics to the data item it encloses. Tags can nest: + if tag A encloses tag B, which encloses data item C, tag A applies to + the result of applying tag B on data item C. + + IANA maintains a registry of tag numbers as described in Section 9.2. + Table 5 provides a list of tag numbers that were defined in [RFC7049] + with definitions in the rest of this section. (Tag number 35 was + also defined in [RFC7049]; a discussion of this tag number follows in + Section 3.4.5.3.) Note that many other tag numbers have been defined + since the publication of [RFC7049]; see the registry described at + Section 9.2 for the complete list. + + +=======+=============+==================================+ + | Tag | Data Item | Semantics | + +=======+=============+==================================+ + | 0 | text string | Standard date/time string; see | + | | | Section 3.4.1 | + +-------+-------------+----------------------------------+ + | 1 | integer or | Epoch-based date/time; see | + | | float | Section 3.4.2 | + +-------+-------------+----------------------------------+ + | 2 | byte string | Unsigned bignum; see | + | | | Section 3.4.3 | + +-------+-------------+----------------------------------+ + | 3 | byte string | Negative bignum; see | + | | | Section 3.4.3 | + +-------+-------------+----------------------------------+ + | 4 | array | Decimal fraction; see | + | | | Section 3.4.4 | + +-------+-------------+----------------------------------+ + | 5 | array | Bigfloat; see Section 3.4.4 | + +-------+-------------+----------------------------------+ + | 21 | (any) | Expected conversion to base64url | + | | | encoding; see Section 3.4.5.2 | + +-------+-------------+----------------------------------+ + | 22 | (any) | Expected conversion to base64 | + | | | encoding; see Section 3.4.5.2 | + +-------+-------------+----------------------------------+ + | 23 | (any) | Expected conversion to base16 | + | | | encoding; see Section 3.4.5.2 | + +-------+-------------+----------------------------------+ + | 24 | byte string | Encoded CBOR data item; see | + | | | Section 3.4.5.1 | + +-------+-------------+----------------------------------+ + | 32 | text string | URI; see Section 3.4.5.3 | + +-------+-------------+----------------------------------+ + | 33 | text string | base64url; see Section 3.4.5.3 | + +-------+-------------+----------------------------------+ + | 34 | text string | base64; see Section 3.4.5.3 | + +-------+-------------+----------------------------------+ + | 36 | text string | MIME message; see | + | | | Section 3.4.5.3 | + +-------+-------------+----------------------------------+ + | 55799 | (any) | Self-described CBOR; see | + | | | Section 3.4.6 | + +-------+-------------+----------------------------------+ + + Table 5: Tag Numbers Defined in RFC 7049 + + Conceptually, tags are interpreted in the generic data model, not at + (de-)serialization time. A small number of tags (at this time, tag + number 25 and tag number 29 [IANA.cbor-tags]) have been registered + with semantics that may require processing at (de-)serialization + time: the decoder needs to be aware of, and the encoder needs to be + in control of, the exact sequence in which data items are encoded + into the CBOR data item. This means these tags cannot be implemented + on top of an arbitrary generic CBOR encoder/decoder (which might not + reflect the serialization order for entries in a map at the data + model level and vice versa); their implementation therefore typically + needs to be integrated into the generic encoder/decoder. The + definition of new tags with this property is NOT RECOMMENDED. + + IANA allocated tag numbers 65535, 4294967295, and + 18446744073709551615 (binary all-ones in 16-bit, 32-bit, and 64-bit). + These can be used as a convenience for implementers who want a + single-integer data structure to indicate either the presence of a + specific tag or absence of a tag. That allocation is described in + Section 10 of [CBOR-TAGS]. These tags are not intended to occur in + actual CBOR data items; implementations MAY flag such an occurrence + as an error. + + Protocols can extend the generic data model (Section 2) with data + items representing points in time by using tag numbers 0 and 1, with + arbitrarily sized integers by using tag numbers 2 and 3, and with + floating-point values of arbitrary size and precision by using tag + numbers 4 and 5. + +3.4.1. Standard Date/Time String + + Tag number 0 contains a text string in the standard format described + by the "date-time" production in [RFC3339], as refined by Section 3.3 + of [RFC4287], representing the point in time described there. A + nested item of another type or a text string that doesn't match the + format described in [RFC4287] is invalid. + +3.4.2. Epoch-Based Date/Time + + Tag number 1 contains a numerical value counting the number of + seconds from 1970-01-01T00:00Z in UTC time to the represented point + in civil time. + + The tag content MUST be an unsigned or negative integer (major types + 0 and 1) or a floating-point number (major type 7 with additional + information 25, 26, or 27). Other contained types are invalid. + + Nonnegative values (major type 0 and nonnegative floating-point + numbers) stand for time values on or after 1970-01-01T00:00Z UTC and + are interpreted according to POSIX [TIME_T]. (POSIX time is also + known as "UNIX Epoch time".) Leap seconds are handled specially by + POSIX time, and this results in a 1-second discontinuity several + times per decade. Note that applications that require the expression + of times beyond early 2106 cannot leave out support of 64-bit + integers for the tag content. + + Negative values (major type 1 and negative floating-point numbers) + are interpreted as determined by the application requirements as + there is no universal standard for UTC count-of-seconds time before + 1970-01-01T00:00Z (this is particularly true for points in time that + precede discontinuities in national calendars). The same applies to + non-finite values. + + To indicate fractional seconds, floating-point values can be used + within tag number 1 instead of integer values. Note that this + generally requires binary64 support, as binary16 and binary32 provide + nonzero fractions of seconds only for a short period of time around + early 1970. An application that requires tag number 1 support may + restrict the tag content to be an integer (or a floating-point value) + only. + + Note that platform types for date/time may include "null" or + "undefined" values, which may also be desirable at an application + protocol level. While emitting tag number 1 values with non-finite + tag content values (e.g., with NaN for undefined date/time values or + with Infinity for an expiry date that is not set) may seem an obvious + way to handle this, using untagged "null" or "undefined" avoids the + use of non-finites and results in a shorter encoding. Application + protocol designers are encouraged to consider these cases and include + clear guidelines for handling them. + +3.4.3. Bignums + + Protocols using tag numbers 2 and 3 extend the generic data model + (Section 2) with "bignums" representing arbitrarily sized integers. + In the basic generic data model, bignum values are not equal to + integers from the same model, but the extended generic data model + created by this tag definition defines equivalence based on numeric + value, and preferred serialization (Section 4.1) never makes use of + bignums that also can be expressed as basic integers (see below). + + Bignums are encoded as a byte string data item, which is interpreted + as an unsigned integer n in network byte order. Contained items of + other types are invalid. For tag number 2, the value of the bignum + is n. For tag number 3, the value of the bignum is -1 - n. The + preferred serialization of the byte string is to leave out any + leading zeroes (note that this means the preferred serialization for + n = 0 is the empty byte string, but see below). Decoders that + understand these tags MUST be able to decode bignums that do have + leading zeroes. The preferred serialization of an integer that can + be represented using major type 0 or 1 is to encode it this way + instead of as a bignum (which means that the empty string never + occurs in a bignum when using preferred serialization). Note that + this means the non-preferred choice of a bignum representation + instead of a basic integer for encoding a number is not intended to + have application semantics (just as the choice of a longer basic + integer representation than needed, such as 0x1800 for 0x00, does + not). + + For example, the number 18446744073709551616 (2^(64)) is represented + as 0b110_00010 (major type 6, tag number 2), followed by 0b010_01001 + (major type 2, length 9), followed by 0x010000000000000000 (one byte + 0x01 and eight bytes 0x00). In hexadecimal: + + C2 -- Tag 2 + 49 -- Byte string of length 9 + 010000000000000000 -- Bytes content + +3.4.4. Decimal Fractions and Bigfloats + + Protocols using tag number 4 extend the generic data model with data + items representing arbitrary-length decimal fractions of the form + m*(10^(e)). Protocols using tag number 5 extend the generic data + model with data items representing arbitrary-length binary fractions + of the form m*(2^(e)). As with bignums, values of different types + are not equal in the generic data model. + + Decimal fractions combine an integer mantissa with a base-10 scaling + factor. They are most useful if an application needs the exact + representation of a decimal fraction such as 1.1 because there is no + exact representation for many decimal fractions in binary floating- + point representations. + + "Bigfloats" combine an integer mantissa with a base-2 scaling factor. + They are binary floating-point values that can exceed the range or + the precision of the three IEEE 754 formats supported by CBOR + (Section 3.3). Bigfloats may also be used by constrained + applications that need some basic binary floating-point capability + without the need for supporting IEEE 754. + + A decimal fraction or a bigfloat is represented as a tagged array + that contains exactly two integer numbers: an exponent e and a + mantissa m. Decimal fractions (tag number 4) use base-10 exponents; + the value of a decimal fraction data item is m*(10^(e)). Bigfloats + (tag number 5) use base-2 exponents; the value of a bigfloat data + item is m*(2^(e)). The exponent e MUST be represented in an integer + of major type 0 or 1, while the mantissa can also be a bignum + (Section 3.4.3). Contained items with other structures are invalid. + + An example of a decimal fraction is the representation of the number + 273.15 as 0b110_00100 (major type 6 for tag, additional information 4 + for the tag number), followed by 0b100_00010 (major type 4 for the + array, additional information 2 for the length of the array), + followed by 0b001_00001 (major type 1 for the first integer, + additional information 1 for the value of -2), followed by + 0b000_11001 (major type 0 for the second integer, additional + information 25 for a two-byte value), followed by 0b0110101010110011 + (27315 in two bytes). In hexadecimal: + + C4 -- Tag 4 + 82 -- Array of length 2 + 21 -- -2 + 19 6ab3 -- 27315 + + An example of a bigfloat is the representation of the number 1.5 as + 0b110_00101 (major type 6 for tag, additional information 5 for the + tag number), followed by 0b100_00010 (major type 4 for the array, + additional information 2 for the length of the array), followed by + 0b001_00000 (major type 1 for the first integer, additional + information 0 for the value of -1), followed by 0b000_00011 (major + type 0 for the second integer, additional information 3 for the value + of 3). In hexadecimal: + + C5 -- Tag 5 + 82 -- Array of length 2 + 20 -- -1 + 03 -- 3 + + Decimal fractions and bigfloats provide no representation of + Infinity, -Infinity, or NaN; if these are needed in place of a + decimal fraction or bigfloat, the IEEE 754 half-precision + representations from Section 3.3 can be used. + +3.4.5. Content Hints + + The tags in this section are for content hints that might be used by + generic CBOR processors. These content hints do not extend the + generic data model. + +3.4.5.1. Encoded CBOR Data Item + + Sometimes it is beneficial to carry an embedded CBOR data item that + is not meant to be decoded immediately at the time the enclosing data + item is being decoded. Tag number 24 (CBOR data item) can be used to + tag the embedded byte string as a single data item encoded in CBOR + format. Contained items that aren't byte strings are invalid. A + contained byte string is valid if it encodes a well-formed CBOR data + item; validity checking of the decoded CBOR item is not required for + tag validity (but could be offered by a generic decoder as a special + option). + +3.4.5.2. Expected Later Encoding for CBOR-to-JSON Converters + + Tag numbers 21 to 23 indicate that a byte string might require a + specific encoding when interoperating with a text-based + representation. These tags are useful when an encoder knows that the + byte string data it is writing is likely to be later converted to a + particular JSON-based usage. That usage specifies that some strings + are encoded as base64, base64url, and so on. The encoder uses byte + strings instead of doing the encoding itself to reduce the message + size, to reduce the code size of the encoder, or both. The encoder + does not know whether or not the converter will be generic, and + therefore wants to say what it believes is the proper way to convert + binary strings to JSON. + + The data item tagged can be a byte string or any other data item. In + the latter case, the tag applies to all of the byte string data items + contained in the data item, except for those contained in a nested + data item tagged with an expected conversion. + + These three tag numbers suggest conversions to three of the base data + encodings defined in [RFC4648]. Tag number 21 suggests conversion to + base64url encoding (Section 5 of [RFC4648]) where padding is not used + (see Section 3.2 of [RFC4648]); that is, all trailing equals signs + ("=") are removed from the encoded string. Tag number 22 suggests + conversion to classical base64 encoding (Section 4 of [RFC4648]) with + padding as defined in RFC 4648. For both base64url and base64, + padding bits are set to zero (see Section 3.5 of [RFC4648]), and the + conversion to alternate encoding is performed on the contents of the + byte string (that is, without adding any line breaks, whitespace, or + other additional characters). Tag number 23 suggests conversion to + base16 (hex) encoding with uppercase alphabetics (see Section 8 of + [RFC4648]). Note that, for all three tag numbers, the encoding of + the empty byte string is the empty text string. + +3.4.5.3. Encoded Text + + Some text strings hold data that have formats widely used on the + Internet, and sometimes those formats can be validated and presented + to the application in appropriate form by the decoder. There are + tags for some of these formats. + + * Tag number 32 is for URIs, as defined in [RFC3986]. If the text + string doesn't match the "URI-reference" production, the string is + invalid. + + * Tag numbers 33 and 34 are for base64url- and base64-encoded text + strings, respectively, as defined in [RFC4648]. If any of the + following apply: + + - the encoded text string contains non-alphabet characters or + only 1 alphabet character in the last block of 4 (where + alphabet is defined by Section 5 of [RFC4648] for tag number 33 + and Section 4 of [RFC4648] for tag number 34), or + + - the padding bits in a 2- or 3-character block are not 0, or + + - the base64 encoding has the wrong number of padding characters, + or + + - the base64url encoding has padding characters, + + the string is invalid. + + * Tag number 36 is for MIME messages (including all headers), as + defined in [RFC2045]. A text string that isn't a valid MIME + message is invalid. (For this tag, validity checking may be + particularly onerous for a generic decoder and might therefore not + be offered. Note that many MIME messages are general binary data + and therefore cannot be represented in a text string; + [IANA.cbor-tags] lists a registration for tag number 257 that is + similar to tag number 36 but uses a byte string as its tag + content.) + + Note that tag numbers 33 and 34 differ from 21 and 22 in that the + data is transported in base-encoded form for the former and in raw + byte string form for the latter. + + [RFC7049] also defined a tag number 35 for regular expressions that + are in Perl Compatible Regular Expressions (PCRE/PCRE2) form [PCRE] + or in JavaScript regular expression syntax [ECMA262]. The state of + the art in these regular expression specifications has since advanced + and is continually advancing, so this specification does not attempt + to update the references. Instead, this tag remains available (as + registered in [RFC7049]) for applications that specify the particular + regular expression variant they use out-of-band (possibly by limiting + the usage to a defined common subset of both PCRE and ECMA262). As + this specification clarifies tag validity beyond [RFC7049], we note + that due to the open way the tag was defined in [RFC7049], any + contained string value needs to be valid at the CBOR tag level (but + then may not be "expected" at the application level). + +3.4.6. Self-Described CBOR + + In many applications, it will be clear from the context that CBOR is + being employed for encoding a data item. For instance, a specific + protocol might specify the use of CBOR, or a media type is indicated + that specifies its use. However, there may be applications where + such context information is not available, such as when CBOR data is + stored in a file that does not have disambiguating metadata. Here, + it may help to have some distinguishing characteristics for the data + itself. + + Tag number 55799 is defined for this purpose, specifically for use at + the start of a stored encoded CBOR data item as specified by an + application. It does not impart any special semantics on the data + item that it encloses; that is, the semantics of the tag content + enclosed in tag number 55799 is exactly identical to the semantics of + the tag content itself. + + The serialization of this tag's head is 0xd9d9f7, which does not + appear to be in use as a distinguishing mark for any frequently used + file types. In particular, 0xd9d9f7 is not a valid start of a + Unicode text in any Unicode encoding if it is followed by a valid + CBOR data item. + + For instance, a decoder might be able to decode both CBOR and JSON. + Such a decoder would need to mechanically distinguish the two + formats. An easy way for an encoder to help the decoder would be to + tag the entire CBOR item with tag number 55799, the serialization of + which will never be found at the beginning of a JSON text. + +4. Serialization Considerations + +4.1. Preferred Serialization + + For some values at the data model level, CBOR provides multiple + serializations. For many applications, it is desirable that an + encoder always chooses a preferred serialization (preferred + encoding); however, the present specification does not put the burden + of enforcing this preference on either the encoder or decoder. + + Some constrained decoders may be limited in their ability to decode + non-preferred serializations: for example, if only integers below + 1_000_000_000 (one billion) are expected in an application, the + decoder may leave out the code that would be needed to decode 64-bit + arguments in integers. An encoder that always uses preferred + serialization ("preferred encoder") interoperates with this decoder + for the numbers that can occur in this application. Generally + speaking, a preferred encoder is more universally interoperable (and + also less wasteful) than one that, say, always uses 64-bit integers. + + Similarly, a constrained encoder may be limited in the variety of + representation variants it supports such that it does not emit + preferred serializations ("variant encoder"). For instance, a + constrained encoder could be designed to always use the 32-bit + variant for an integer that it encodes even if a short representation + is available (assuming that there is no application need for integers + that can only be represented with the 64-bit variant). A decoder + that does not rely on receiving only preferred serializations + ("variation-tolerant decoder") can therefore be said to be more + universally interoperable (it might very well optimize for the case + of receiving preferred serializations, though). Full implementations + of CBOR decoders are by definition variation tolerant; the + distinction is only relevant if a constrained implementation of a + CBOR decoder meets a variant encoder. + + The preferred serialization always uses the shortest form of + representing the argument (Section 3); it also uses the shortest + floating-point encoding that preserves the value being encoded. + + The preferred serialization for a floating-point value is the + shortest floating-point encoding that preserves its value, e.g., + 0xf94580 for the number 5.5, and 0xfa45ad9c00 for the number 5555.5. + For NaN values, a shorter encoding is preferred if zero-padding the + shorter significand towards the right reconstitutes the original NaN + value (for many applications, the single NaN encoding 0xf97e00 will + suffice). + + Definite-length encoding is preferred whenever the length is known at + the time the serialization of the item starts. + +4.2. Deterministically Encoded CBOR + + Some protocols may want encoders to only emit CBOR in a particular + deterministic format; those protocols might also have the decoders + check that their input is in that deterministic format. Those + protocols are free to define what they mean by a "deterministic + format" and what encoders and decoders are expected to do. This + section defines a set of restrictions that can serve as the base of + such a deterministic format. + +4.2.1. Core Deterministic Encoding Requirements + + A CBOR encoding satisfies the "core deterministic encoding + requirements" if it satisfies the following restrictions: + + * Preferred serialization MUST be used. In particular, this means + that arguments (see Section 3) for integers, lengths in major + types 2 through 5, and tags MUST be as short as possible, for + instance: + + - 0 to 23 and -1 to -24 MUST be expressed in the same byte as the + major type; + + - 24 to 255 and -25 to -256 MUST be expressed only with an + additional uint8_t; + + - 256 to 65535 and -257 to -65536 MUST be expressed only with an + additional uint16_t; + + - 65536 to 4294967295 and -65537 to -4294967296 MUST be expressed + only with an additional uint32_t. + + Floating-point values also MUST use the shortest form that + preserves the value, e.g., 1.5 is encoded as 0xf93e00 (binary16) + and 1000000.5 as 0xfa49742408 (binary32). (One implementation of + this is to have all floats start as a 64-bit float, then do a test + conversion to a 32-bit float; if the result is the same numeric + value, use the shorter form and repeat the process with a test + conversion to a 16-bit float. This also works to select 16-bit + float for positive and negative Infinity as well.) + + * Indefinite-length items MUST NOT appear. They can be encoded as + definite-length items instead. + + * The keys in every map MUST be sorted in the bytewise lexicographic + order of their deterministic encodings. For example, the + following keys are sorted correctly: + + 1. 10, encoded as 0x0a. + + 2. 100, encoded as 0x1864. + + 3. -1, encoded as 0x20. + + 4. "z", encoded as 0x617a. + + 5. "aa", encoded as 0x626161. + + 6. [100], encoded as 0x811864. + + 7. [-1], encoded as 0x8120. + + 8. false, encoded as 0xf4. + + | Implementation note: the self-delimiting nature of the CBOR + | encoding means that there are no two well-formed CBOR encoded + | data items where one is a prefix of the other. The bytewise + | lexicographic comparison of deterministic encodings of + | different map keys therefore always ends in a position where + | the byte differs between the keys, before the end of a key is + | reached. + +4.2.2. Additional Deterministic Encoding Considerations + + CBOR tags present additional considerations for deterministic + encoding. If a CBOR-based protocol were to provide the same + semantics for the presence and absence of a specific tag (e.g., by + allowing both tag 1 data items and raw numbers in a date/time + position, treating the latter as if they were tagged), the + deterministic format would not allow the presence of the tag, based + on the "shortest form" principle. For example, a protocol might give + encoders the choice of representing a URL as either a text string or, + using Section 3.4.5.3, tag number 32 containing a text string. This + protocol's deterministic encoding needs either to require that the + tag is present or to require that it is absent, not allow either one. + + In a protocol that does require tags in certain places to obtain + specific semantics, the tag needs to appear in the deterministic + format as well. Deterministic encoding considerations also apply to + the content of tags. + + If a protocol includes a field that can express integers with an + absolute value of 2^(64) or larger using tag numbers 2 or 3 + (Section 3.4.3), the protocol's deterministic encoding needs to + specify whether smaller integers are also expressed using these tags + or using major types 0 and 1. Preferred serialization uses the + latter choice, which is therefore recommended. + + Protocols that include floating-point values, whether represented + using basic floating-point values (Section 3.3) or using tags (or + both), may need to define extra requirements on their deterministic + encodings, such as: + + * Although IEEE floating-point values can represent both positive + and negative zero as distinct values, the application might not + distinguish these and might decide to represent all zero values + with a positive sign, disallowing negative zero. (The application + may also want to restrict the precision of floating-point values + in such a way that there is never a need to represent 64-bit -- or + even 32-bit -- floating-point values.) + + * If a protocol includes a field that can express floating-point + values, with a specific data model that declares integer and + floating-point values to be interchangeable, the protocol's + deterministic encoding needs to specify whether, for example, the + integer 1.0 is encoded as 0x01 (unsigned integer), 0xf93c00 + (binary16), 0xfa3f800000 (binary32), or 0xfb3ff0000000000000 + (binary64). Example rules for this are: + + 1. Encode integral values that fit in 64 bits as values from + major types 0 and 1, and other values as the preferred + (smallest of 16-, 32-, or 64-bit) floating-point + representation that accurately represents the value, + + 2. Encode all values as the preferred floating-point + representation that accurately represents the value, even for + integral values, or + + 3. Encode all values as 64-bit floating-point representations. + + Rule 1 straddles the boundaries between integers and floating- + point values, and Rule 3 does not use preferred serialization, so + Rule 2 may be a good choice in many cases. + + * If NaN is an allowed value, and there is no intent to support NaN + payloads or signaling NaNs, the protocol needs to pick a single + representation, typically 0xf97e00. If that simple choice is not + possible, specific attention will be needed for NaN handling. + + * Subnormal numbers (nonzero numbers with the lowest possible + exponent of a given IEEE 754 number format) may be flushed to zero + outputs or be treated as zero inputs in some floating-point + implementations. A protocol's deterministic encoding may want to + specifically accommodate such implementations while creating an + onus on other implementations by excluding subnormal numbers from + interchange, interchanging zero instead. + + * The same number can be represented by different decimal fractions, + by different bigfloats, and by different forms under other tags + that may be defined to express numeric values. Depending on the + implementation, it may not always be practical to determine + whether any of these forms (or forms in the basic generic data + model) are equivalent. An application protocol that presents + choices of this kind for the representation format of numbers + needs to be explicit about how the formats for deterministic + encoding are to be chosen. + +4.2.3. Length-First Map Key Ordering + + The core deterministic encoding requirements (Section 4.2.1) sort map + keys in a different order from the one suggested by Section 3.9 of + [RFC7049] (called "Canonical CBOR" there). Protocols that need to be + compatible with the order specified in [RFC7049] can instead be + specified in terms of this specification's "length-first core + deterministic encoding requirements": + + A CBOR encoding satisfies the "length-first core deterministic + encoding requirements" if it satisfies the core deterministic + encoding requirements except that the keys in every map MUST be + sorted such that: + + 1. If two keys have different lengths, the shorter one sorts + earlier; + + 2. If two keys have the same length, the one with the lower value in + (bytewise) lexical order sorts earlier. + + For example, under the length-first core deterministic encoding + requirements, the following keys are sorted correctly: + + 1. 10, encoded as 0x0a. + + 2. -1, encoded as 0x20. + + 3. false, encoded as 0xf4. + + 4. 100, encoded as 0x1864. + + 5. "z", encoded as 0x617a. + + 6. [-1], encoded as 0x8120. + + 7. "aa", encoded as 0x626161. + + 8. [100], encoded as 0x811864. + + | Although [RFC7049] used the term "Canonical CBOR" for its form + | of requirements on deterministic encoding, this document avoids + | this term because "canonicalization" is often associated with + | specific uses of deterministic encoding only. The terms are + | essentially interchangeable, however, and the set of core + | requirements in this document could also be called "Canonical + | CBOR", while the length-first-ordered version of that could be + | called "Old Canonical CBOR". + +5. Creating CBOR-Based Protocols + + Data formats such as CBOR are often used in environments where there + is no format negotiation. A specific design goal of CBOR is to not + need any included or assumed schema: a decoder can take a CBOR item + and decode it with no other knowledge. + + Of course, in real-world implementations, the encoder and the decoder + will have a shared view of what should be in a CBOR data item. For + example, an agreed-to format might be "the item is an array whose + first value is a UTF-8 string, second value is an integer, and + subsequent values are zero or more floating-point numbers" or "the + item is a map that has byte strings for keys and contains a pair + whose key is 0xab01". + + CBOR-based protocols MUST specify how their decoders handle invalid + and other unexpected data. CBOR-based protocols MAY specify that + they treat arbitrary valid data as unexpected. Encoders for CBOR- + based protocols MUST produce only valid items, that is, the protocol + cannot be designed to make use of invalid items. An encoder can be + capable of encoding as many or as few types of values as is required + by the protocol in which it is used; a decoder can be capable of + understanding as many or as few types of values as is required by the + protocols in which it is used. This lack of restrictions allows CBOR + to be used in extremely constrained environments. + + The rest of this section discusses some considerations in creating + CBOR-based protocols. With few exceptions, it is advisory only and + explicitly excludes any language from BCP 14 [RFC2119] [RFC8174] + other than words that could be interpreted as "MAY" in the sense of + BCP 14. The exceptions aim at facilitating interoperability of CBOR- + based protocols while making use of a wide variety of both generic + and application-specific encoders and decoders. + +5.1. CBOR in Streaming Applications + + In a streaming application, a data stream may be composed of a + sequence of CBOR data items concatenated back-to-back. In such an + environment, the decoder immediately begins decoding a new data item + if data is found after the end of a previous data item. + + Not all of the bytes making up a data item may be immediately + available to the decoder; some decoders will buffer additional data + until a complete data item can be presented to the application. + Other decoders can present partial information about a top-level data + item to an application, such as the nested data items that could + already be decoded, or even parts of a byte string that hasn't + completely arrived yet. Such an application also MUST have a + matching streaming security mechanism, where the desired protection + is available for incremental data presented to the application. + + Note that some applications and protocols will not want to use + indefinite-length encoding. Using indefinite-length encoding allows + an encoder to not need to marshal all the data for counting, but it + requires a decoder to allocate increasing amounts of memory while + waiting for the end of the item. This might be fine for some + applications but not others. + +5.2. Generic Encoders and Decoders + + A generic CBOR decoder can decode all well-formed encoded CBOR data + items and present the data items to an application. See Appendix C. + (The diagnostic notation, Section 8, may be used to present well- + formed CBOR values to humans.) + + Generic CBOR encoders provide an application interface that allows + the application to specify any well-formed value to be encoded as a + CBOR data item, including simple values and tags unknown to the + encoder. + + Even though CBOR attempts to minimize these cases, not all well- + formed CBOR data is valid: for example, the encoded text string + "0x62c0ae" does not contain valid UTF-8 (because [RFC3629] requires + always using the shortest form) and so is not a valid CBOR item. + Also, specific tags may make semantic constraints that may be + violated, for instance, by a bignum tag enclosing another tag or by + an instance of tag number 0 containing a byte string or containing a + text string with contents that do not match the "date-time" + production of [RFC3339]. There is no requirement that generic + encoders and decoders make unnatural choices for their application + interface to enable the processing of invalid data. Generic encoders + and decoders are expected to forward simple values and tags even if + their specific codepoints are not registered at the time the encoder/ + decoder is written (Section 5.4). + +5.3. Validity of Items + + A well-formed but invalid CBOR data item (Section 1.2) presents a + problem with interpreting the data encoded in it in the CBOR data + model. A CBOR-based protocol could be specified in several layers, + in which the lower layers don't process the semantics of some of the + CBOR data they forward. These layers can't notice any validity + errors in data they don't process and MUST forward that data as-is. + The first layer that does process the semantics of an invalid CBOR + item MUST pick one of two choices: + + 1. Replace the problematic item with an error marker and continue + with the next item, or + + 2. Issue an error and stop processing altogether. + + A CBOR-based protocol MUST specify which of these options its + decoders take for each kind of invalid item they might encounter. + + Such problems might occur at the basic validity level of CBOR or in + the context of tags (tag validity). + +5.3.1. Basic validity + + Two kinds of validity errors can occur in the basic generic data + model: + + Duplicate keys in a map: Generic decoders (Section 5.2) make data + available to applications using the native CBOR data model. That + data model includes maps (key-value mappings with unique keys), + not multimaps (key-value mappings where multiple entries can have + the same key). Thus, a generic decoder that gets a CBOR map item + that has duplicate keys will decode to a map with only one + instance of that key, or it might stop processing altogether. On + the other hand, a "streaming decoder" may not even be able to + notice. See Section 5.6 for more discussion of keys in maps. + + Invalid UTF-8 string: A decoder might or might not want to verify + that the sequence of bytes in a UTF-8 string (major type 3) is + actually valid UTF-8 and react appropriately. + +5.3.2. Tag validity + + Two additional kinds of validity errors are introduced by adding tags + to the basic generic data model: + + Inadmissible type for tag content: Tag numbers (Section 3.4) specify + what type of data item is supposed to be used as their tag + content; for example, the tag numbers for unsigned or negative + bignums are supposed to be put on byte strings. A decoder that + decodes the tagged data item into a native representation (a + native big integer in this example) is expected to check the type + of the data item being tagged. Even decoders that don't have such + native representations available in their environment may perform + the check on those tags known to them and react appropriately. + + Inadmissible value for tag content: The type of data item may be + admissible for a tag's content, but the specific value may not be; + e.g., a value of "yesterday" is not acceptable for the content of + tag 0, even though it properly is a text string. A decoder that + normally ingests such tags into equivalent platform types might + present this tag to the application in a similar way to how it + would present a tag with an unknown tag number (Section 5.4). + +5.4. Validity and Evolution + + A decoder with validity checking will expend the effort to reliably + detect data items with validity errors. For example, such a decoder + needs to have an API that reports an error (and does not return data) + for a CBOR data item that contains any of the validity errors listed + in the previous subsection. + + The set of tags defined in the "Concise Binary Object Representation + (CBOR) Tags" registry (Section 9.2), as well as the set of simple + values defined in the "Concise Binary Object Representation (CBOR) + Simple Values" registry (Section 9.1), can grow at any time beyond + the set understood by a generic decoder. A validity-checking decoder + can do one of two things when it encounters such a case that it does + not recognize: + + * It can report an error (and not return data). Note that treating + this case as an error can cause ossification and is thus not + encouraged. This error is not a validity error, per se. This + kind of error is more likely to be raised by a decoder that would + be performing validity checking if this were a known case. + + * It can emit the unknown item (type, value, and, for tags, the + decoded tagged data item) to the application calling the decoder, + and then give the application an indication that the decoder did + not recognize that tag number or simple value. + + The latter approach, which is also appropriate for decoders that do + not support validity checking, provides forward compatibility with + newly registered tags and simple values without the requirement to + update the encoder at the same time as the calling application. (For + this, the decoder's API needs the ability to mark unknown items so + that the calling application can handle them in a manner appropriate + for the program.) + + Since some of the processing needed for validity checking may have an + appreciable cost (in particular with duplicate detection for maps), + support of validity checking is not a requirement placed on all CBOR + decoders. + + Some encoders will rely on their applications to provide input data + in such a way that valid CBOR results from the encoder. A generic + encoder may also want to provide a validity-checking mode where it + reliably limits its output to valid CBOR, independent of whether or + not its application is indeed providing API-conformant data. + +5.5. Numbers + + CBOR-based protocols should take into account that different language + environments pose different restrictions on the range and precision + of numbers that are representable. For example, the basic JavaScript + number system treats all numbers as floating-point values, which may + result in the silent loss of precision in decoding integers with more + than 53 significant bits. Another example is that, since CBOR keeps + the sign bit for its integer representation in the major type, it has + one bit more for signed numbers of a certain length (e.g., + -2^(64)..2^(64)-1 for 1+8-byte integers) than the typical platform + signed integer representation of the same length (-2^(63)..2^(63)-1 + for 8-byte int64_t). A protocol that uses numbers should define its + expectations on the handling of nontrivial numbers in decoders and + receiving applications. + + A CBOR-based protocol that includes floating-point numbers can + restrict which of the three formats (half-precision, single- + precision, and double-precision) are to be supported. For an + integer-only application, a protocol may want to completely exclude + the use of floating-point values. + + A CBOR-based protocol designed for compactness may want to exclude + specific integer encodings that are longer than necessary for the + application, such as to save the need to implement 64-bit integers. + There is an expectation that encoders will use the most compact + integer representation that can represent a given value. However, a + compact application that does not require deterministic encoding + should accept values that use a longer-than-needed encoding (such as + encoding "0" as 0b000_11001 followed by two bytes of 0x00) as long as + the application can decode an integer of the given size. Similar + considerations apply to floating-point values; decoding both + preferred serializations and longer-than-needed ones is recommended. + + CBOR-based protocols for constrained applications that provide a + choice between representing a specific number as an integer and as a + decimal fraction or bigfloat (such as when the exponent is small and + nonnegative) might express a quality-of-implementation expectation + that the integer representation is used directly. + +5.6. Specifying Keys for Maps + + The encoding and decoding applications need to agree on what types of + keys are going to be used in maps. In applications that need to + interwork with JSON-based applications, conversion is simplified by + limiting keys to text strings only; otherwise, there has to be a + specified mapping from the other CBOR types to text strings, and this + often leads to implementation errors. In applications where keys are + numeric in nature, and numeric ordering of keys is important to the + application, directly using the numbers for the keys is useful. + + If multiple types of keys are to be used, consideration should be + given to how these types would be represented in the specific + programming environments that are to be used. For example, in + JavaScript Maps [ECMA262], a key of integer 1 cannot be distinguished + from a key of floating-point 1.0. This means that, if integer keys + are used, the protocol needs to avoid the use of floating-point keys + the values of which happen to be integer numbers in the same map. + + Decoders that deliver data items nested within a CBOR data item + immediately on decoding them ("streaming decoders") often do not keep + the state that is necessary to ascertain uniqueness of a key in a + map. Similarly, an encoder that can start encoding data items before + the enclosing data item is completely available ("streaming encoder") + may want to reduce its overhead significantly by relying on its data + source to maintain uniqueness. + + A CBOR-based protocol MUST define what to do when a receiving + application sees multiple identical keys in a map. The resulting + rule in the protocol MUST respect the CBOR data model: it cannot + prescribe a specific handling of the entries with the identical keys, + except that it might have a rule that having identical keys in a map + indicates a malformed map and that the decoder has to stop with an + error. When processing maps that exhibit entries with duplicate + keys, a generic decoder might do one of the following: + + * Not accept maps with duplicate keys (that is, enforce validity for + maps, see also Section 5.4). These generic decoders are + universally useful. An application may still need to perform its + own duplicate checking based on application rules (for instance, + if the application equates integers and floating-point values in + map key positions for specific maps). + + * Pass all map entries to the application, including ones with + duplicate keys. This requires that the application handle (check + against) duplicate keys, even if the application rules are + identical to the generic data model rules. + + * Lose some entries with duplicate keys, e.g., deliver only the + final (or first) entry out of the entries with the same key. With + such a generic decoder, applications may get different results for + a specific key on different runs, and with different generic + decoders, which value is returned is based on generic decoder + implementation and the actual order of keys in the map. In + particular, applications cannot validate key uniqueness on their + own as they do not necessarily see all entries; they may not be + able to use such a generic decoder if they need to validate key + uniqueness. These generic decoders can only be used in situations + where the data source and transfer always provide valid maps; this + is not possible if the data source and transfer can be attacked. + + Generic decoders need to document which of these three approaches + they implement. + + The CBOR data model for maps does not allow ascribing semantics to + the order of the key/value pairs in the map representation. Thus, a + CBOR-based protocol MUST NOT specify that changing the key/value pair + order in a map changes the semantics, except to specify that some + orders are disallowed, for example, where they would not meet the + requirements of a deterministic encoding (Section 4.2). (Any + secondary effects of map ordering such as on timing, cache usage, and + other potential side channels are not considered part of the + semantics but may be enough reason on their own for a protocol to + require a deterministic encoding format.) + + Applications for constrained devices should consider using small + integers as keys if they have maps with a small number of frequently + used keys; for instance, a set of 24 or fewer keys can be encoded in + a single byte as unsigned integers, up to 48 if negative integers are + also used. Less frequently occurring keys can then use integers with + longer encodings. + +5.6.1. Equivalence of Keys + + The specific data model that applies to a CBOR data item is used to + determine whether keys occurring in maps are duplicates or distinct. + + At the generic data model level, numerically equivalent integer and + floating-point values are distinct from each other, as they are from + the various big numbers (Tags 2 to 5). Similarly, text strings are + distinct from byte strings, even if composed of the same bytes. A + tagged value is distinct from an untagged value or from a value + tagged with a different tag number. + + Within each of these groups, numeric values are distinct unless they + are numerically equal (specifically, -0.0 is equal to 0.0); for the + purpose of map key equivalence, NaN values are equivalent if they + have the same significand after zero-extending both significands at + the right to 64 bits. + + Both byte strings and text strings are compared byte by byte, arrays + are compared element by element, and are equal if they have the same + number of bytes/elements and the same values at the same positions. + Two maps are equal if they have the same set of pairs regardless of + their order; pairs are equal if both the key and value are equal. + + Tagged values are equal if both the tag number and the tag content + are equal. (Note that a generic decoder that provides processing for + a specific tag may not be able to distinguish some semantically + equivalent values, e.g., if leading zeroes occur in the content of + tag 2 or tag 3 (Section 3.4.3).) Simple values are equal if they + simply have the same value. Nothing else is equal in the generic + data model; a simple value 2 is not equivalent to an integer 2, and + an array is never equivalent to a map. + + As discussed in Section 2.2, specific data models can make values + equivalent for the purpose of comparing map keys that are distinct in + the generic data model. Note that this implies that a generic + decoder may deliver a decoded map to an application that needs to be + checked for duplicate map keys by that application (alternatively, + the decoder may provide a programming interface to perform this + service for the application). Specific data models are not able to + distinguish values for map keys that are equal for this purpose at + the generic data model level. + +5.7. Undefined Values + + In some CBOR-based protocols, the simple value (Section 3.3) of + "undefined" might be used by an encoder as a substitute for a data + item with an encoding problem, in order to allow the rest of the + enclosing data items to be encoded without harm. + +6. Converting Data between CBOR and JSON + + This section gives non-normative advice about converting between CBOR + and JSON. Implementations of converters MAY use whichever advice + here they want. + + It is worth noting that a JSON text is a sequence of characters, not + an encoded sequence of bytes, while a CBOR data item consists of + bytes, not characters. + +6.1. Converting from CBOR to JSON + + Most of the types in CBOR have direct analogs in JSON. However, some + do not, and someone implementing a CBOR-to-JSON converter has to + consider what to do in those cases. The following non-normative + advice deals with these by converting them to a single substitute + value, such as a JSON null. + + * An integer (major type 0 or 1) becomes a JSON number. + + * A byte string (major type 2) that is not embedded in a tag that + specifies a proposed encoding is encoded in base64url without + padding and becomes a JSON string. + + * A UTF-8 string (major type 3) becomes a JSON string. Note that + JSON requires escaping certain characters ([RFC8259], Section 7): + quotation mark (U+0022), reverse solidus (U+005C), and the "C0 + control characters" (U+0000 through U+001F). All other characters + are copied unchanged into the JSON UTF-8 string. + + * An array (major type 4) becomes a JSON array. + + * A map (major type 5) becomes a JSON object. This is possible + directly only if all keys are UTF-8 strings. A converter might + also convert other keys into UTF-8 strings (such as by converting + integers into strings containing their decimal representation); + however, doing so introduces a danger of key collision. Note also + that, if tags on UTF-8 strings are ignored as proposed below, this + will cause a key collision if the tags are different but the + strings are the same. + + * False (major type 7, additional information 20) becomes a JSON + false. + + * True (major type 7, additional information 21) becomes a JSON + true. + + * Null (major type 7, additional information 22) becomes a JSON + null. + + * A floating-point value (major type 7, additional information 25 + through 27) becomes a JSON number if it is finite (that is, it can + be represented in a JSON number); if the value is non-finite (NaN, + or positive or negative Infinity), it is represented by the + substitute value. + + * Any other simple value (major type 7, any additional information + value not yet discussed) is represented by the substitute value. + + * A bignum (major type 6, tag number 2 or 3) is represented by + encoding its byte string in base64url without padding and becomes + a JSON string. For tag number 3 (negative bignum), a "~" (ASCII + tilde) is inserted before the base-encoded value. (The conversion + to a binary blob instead of a number is to prevent a likely + numeric overflow for the JSON decoder.) + + * A byte string with an encoding hint (major type 6, tag number 21 + through 23) is encoded as described by the hint and becomes a JSON + string. + + * For all other tags (major type 6, any other tag number), the tag + content is represented as a JSON value; the tag number is ignored. + + * Indefinite-length items are made definite before conversion. + + A CBOR-to-JSON converter may want to keep to the JSON profile I-JSON + [RFC7493], to maximize interoperability and increase confidence that + the JSON output can be processed with predictable results. For + example, this has implications on the range of integers that can be + represented reliably, as well as on the top-level items that may be + supported by older JSON implementations. + +6.2. Converting from JSON to CBOR + + All JSON values, once decoded, directly map into one or more CBOR + values. As with any kind of CBOR generation, decisions have to be + made with respect to number representation. In a suggested + conversion: + + * JSON numbers without fractional parts (integer numbers) are + represented as integers (major types 0 and 1, possibly major type + 6, tag number 2 and 3), choosing the shortest form; integers + longer than an implementation-defined threshold may instead be + represented as floating-point values. The default range that is + represented as integer is -2^(53)+1..2^(53)-1 (fully exploiting + the range for exact integers in the binary64 representation often + used for decoding JSON [RFC7493]). A CBOR-based protocol, or a + generic converter implementation, may choose -2^(32)..2^(32)-1 or + -2^(64)..2^(64)-1 (fully using the integer ranges available in + CBOR with uint32_t or uint64_t, respectively) or even + -2^(31)..2^(31)-1 or -2^(63)..2^(63)-1 (using popular ranges for + two's complement signed integers). (If the JSON was generated + from a JavaScript implementation, its precision is already limited + to 53 bits maximum.) + + * Numbers with fractional parts are represented as floating-point + values, performing the decimal-to-binary conversion based on the + precision provided by IEEE 754 binary64. The mathematical value + of the JSON number is converted to binary64 using the + roundTiesToEven procedure in Section 4.3.1 of [IEEE754]. Then, + when encoding in CBOR, the preferred serialization uses the + shortest floating-point representation exactly representing this + conversion result; for instance, 1.5 is represented in a 16-bit + floating-point value (not all implementations will be capable of + efficiently finding the minimum form, though). Instead of using + the default binary64 precision, there may be an implementation- + defined limit to the precision of the conversion that will affect + the precision of the represented values. Decimal representation + should only be used on the CBOR side if that is specified in a + protocol. + + CBOR has been designed to generally provide a more compact encoding + than JSON. One implementation strategy that might come to mind is to + perform a JSON-to-CBOR encoding in place in a single buffer. This + strategy would need to carefully consider a number of pathological + cases, such as that some strings represented with no or very few + escapes and longer (or much longer) than 255 bytes may expand when + encoded as UTF-8 strings in CBOR. Similarly, a few of the binary + floating-point representations might cause expansion from some short + decimal representations (1.1, 1e9) in JSON. This may be hard to get + right, and any ensuing vulnerabilities may be exploited by an + attacker. + +7. Future Evolution of CBOR + + Successful protocols evolve over time. New ideas appear, + implementation platforms improve, related protocols are developed and + evolve, and new requirements from applications and protocols are + added. Facilitating protocol evolution is therefore an important + design consideration for any protocol development. + + For protocols that will use CBOR, CBOR provides some useful + mechanisms to facilitate their evolution. Best practices for this + are well known, particularly from JSON format development of JSON- + based protocols. Therefore, such best practices are outside the + scope of this specification. + + However, facilitating the evolution of CBOR itself is very well + within its scope. CBOR is designed to both provide a stable basis + for development of CBOR-based protocols and to be able to evolve. + Since a successful protocol may live for decades, CBOR needs to be + designed for decades of use and evolution. This section provides + some guidance for the evolution of CBOR. It is necessarily more + subjective than other parts of this document. It is also necessarily + incomplete, lest it turn into a textbook on protocol development. + +7.1. Extension Points + + In a protocol design, opportunities for evolution are often included + in the form of extension points. For example, there may be a + codepoint space that is not fully allocated from the outset, and the + protocol is designed to tolerate and embrace implementations that + start using more codepoints than initially allocated. + + Sizing the codepoint space may be difficult because the range + required may be hard to predict. Protocol designs should attempt to + make the codepoint space large enough so that it can slowly be filled + over the intended lifetime of the protocol. + + CBOR has three major extension points: + + the "simple" space (values in major type 7): Of the 24 efficient + (and 224 slightly less efficient) values, only a small number have + been allocated. Implementations receiving an unknown simple data + item may easily be able to process it as such, given that the + structure of the value is indeed simple. The IANA registry in + Section 9.1 is the appropriate way to address the extensibility of + this codepoint space. + + the "tag" space (values in major type 6): The total codepoint space + is abundant; only a tiny part of it has been allocated. However, + not all of these codepoints are equally efficient: the first 24 + only consume a single ("1+0") byte, and half of them have already + been allocated. The next 232 values only consume two ("1+1") + bytes, with nearly a quarter already allocated. These subspaces + need some curation to last for a few more decades. + Implementations receiving an unknown tag number can choose to + process just the enclosed tag content or, preferably, to process + the tag as an unknown tag number wrapping the tag content. The + IANA registry in Section 9.2 is the appropriate way to address the + extensibility of this codepoint space. + + the "additional information" space: An implementation receiving an + unknown additional information value has no way to continue + decoding, so allocating codepoints in this space is a major step + beyond just exercising an extension point. There are also very + few codepoints left. See also Section 7.2. + +7.2. Curating the Additional Information Space + + The human mind is sometimes drawn to filling in little perceived gaps + to make something neat. We expect the remaining gaps in the + codepoint space for the additional information values to be an + attractor for new ideas, just because they are there. + + The present specification does not manage the additional information + codepoint space by an IANA registry. Instead, allocations out of + this space can only be done by updating this specification. + + For an additional information value of n >= 24, the size of the + additional data typically is 2^(n-24) bytes. Therefore, additional + information values 28 and 29 should be viewed as candidates for + 128-bit and 256-bit quantities, in case a need arises to add them to + the protocol. Additional information value 30 is then the only + additional information value available for general allocation, and + there should be a very good reason for allocating it before assigning + it through an update of the present specification. + +8. Diagnostic Notation + + CBOR is a binary interchange format. To facilitate documentation and + debugging, and in particular to facilitate communication between + entities cooperating in debugging, this section defines a simple + human-readable diagnostic notation. All actual interchange always + happens in the binary format. + + Note that this truly is a diagnostic format; it is not meant to be + parsed. Therefore, no formal definition (as in ABNF) is given in + this document. (Implementers looking for a text-based format for + representing CBOR data items in configuration files may also want to + consider YAML [YAML].) + + The diagnostic notation is loosely based on JSON as it is defined in + RFC 8259, extending it where needed. + + The notation borrows the JSON syntax for numbers (integer and + floating-point), True (>true<), False (>false<), Null (>null<), UTF-8 + strings, arrays, and maps (maps are called objects in JSON; the + diagnostic notation extends JSON here by allowing any data item in + the key position). Undefined is written >undefined< as in + JavaScript. The non-finite floating-point numbers Infinity, + -Infinity, and NaN are written exactly as in this sentence (this is + also a way they can be written in JavaScript, although JSON does not + allow them). A tag is written as an integer number for the tag + number, followed by the tag content in parentheses; for instance, a + date in the format specified by RFC 3339 (ISO 8601) could be notated + as: + + 0("2013-03-21T20:04:00Z") + + or the equivalent relative time as the following: + + 1(1363896240) + + Byte strings are notated in one of the base encodings, without + padding, enclosed in single quotes, prefixed by >h< for base16, >b32< + for base32, >h32< for base32hex, >b64< for base64 or base64url (the + actual encodings do not overlap, so the string remains unambiguous). + For example, the byte string 0x12345678 could be written h'12345678', + b32'CI2FM6A', or b64'EjRWeA'. + + Unassigned simple values are given as "simple()" with the appropriate + integer in the parentheses. For example, "simple(42)" indicates + major type 7, value 42. + + A number of useful extensions to the diagnostic notation defined here + are provided in Appendix G of [RFC8610], "Extended Diagnostic + Notation" (EDN). Similarly, this notation could be extended in a + separate document to provide documentation for NaN payloads, which + are not covered in this document. + +8.1. Encoding Indicators + + Sometimes it is useful to indicate in the diagnostic notation which + of several alternative representations were actually used; for + example, a data item written >1.5< by a diagnostic decoder might have + been encoded as a half-, single-, or double-precision float. + + The convention for encoding indicators is that anything starting with + an underscore and all following characters that are alphanumeric or + underscore is an encoding indicator, and can be ignored by anyone not + interested in this information. For example, "_" or "_3". Encoding + indicators are always optional. + + A single underscore can be written after the opening brace of a map + or the opening bracket of an array to indicate that the data item was + represented in indefinite-length format. For example, [_ 1, 2] + contains an indicator that an indefinite-length representation was + used to represent the data item [1, 2]. + + An underscore followed by a decimal digit n indicates that the + preceding item (or, for arrays and maps, the item starting with the + preceding bracket or brace) was encoded with an additional + information value of 24+n. For example, 1.5_1 is a half-precision + floating-point number, while 1.5_3 is encoded as double precision. + This encoding indicator is not shown in Appendix A. (Note that the + encoding indicator "_" is thus an abbreviation of the full form "_7", + which is not used.) + + The detailed chunk structure of byte and text strings of indefinite + length can be notated in the form (_ h'0123', h'4567') and (_ "foo", + "bar"). However, for an indefinite-length string with no chunks + inside, (_ ) would be ambiguous as to whether a byte string (0x5fff) + or a text string (0x7fff) is meant and is therefore not used. The + basic forms ''_ and ""_ can be used instead and are reserved for the + case of no chunks only -- not as short forms for the (permitted, but + not really useful) encodings with only empty chunks, which need to be + notated as (_ ''), (_ ""), etc., to preserve the chunk structure. + +9. IANA Considerations + + IANA has created two registries for new CBOR values. The registries + are separate, that is, not under an umbrella registry, and follow the + rules in [RFC8126]. IANA has also assigned a new media type, an + associated CoAP Content-Format entry, and a structured syntax suffix. + +9.1. CBOR Simple Values Registry + + IANA has created the "Concise Binary Object Representation (CBOR) + Simple Values" registry at [IANA.cbor-simple-values]. The initial + values are shown in Table 4. + + New entries in the range 0 to 19 are assigned by Standards Action + [RFC8126]. It is suggested that IANA allocate values starting with + the number 16 in order to reserve the lower numbers for contiguous + blocks (if any). + + New entries in the range 32 to 255 are assigned by Specification + Required. + +9.2. CBOR Tags Registry + + IANA has created the "Concise Binary Object Representation (CBOR) + Tags" registry at [IANA.cbor-tags]. The tags that were defined in + [RFC7049] are described in detail in Section 3.4, and other tags have + already been defined since then. + + New entries in the range 0 to 23 ("1+0") are assigned by Standards + Action. New entries in the ranges 24 to 255 ("1+1") and 256 to 32767 + (lower half of "1+2") are assigned by Specification Required. New + entries in the range 32768 to 18446744073709551615 (upper half of + "1+2", "1+4", and "1+8") are assigned by First Come First Served. + The template for registration requests is: + + * Data item + + * Semantics (short form) + + In addition, First Come First Served requests should include: + + * Point of contact + + * Description of semantics (URL) -- This description is optional; + the URL can point to something like an Internet-Draft or a web + page. + + Applicants exercising the First Come First Served range and making a + suggestion for a tag number that is not representable in 32 bits + (i.e., larger than 4294967295) should be aware that this could reduce + interoperability with implementations that do not support 64-bit + numbers. + +9.3. Media Types Registry + + The Internet media type [RFC6838] ("MIME type") for a single encoded + CBOR data item is "application/cbor" as defined in the "Media Types" + registry [IANA.media-types]: + + Type name: application + + Subtype name: cbor + + Required parameters: n/a + + Optional parameters: n/a + + Encoding considerations: Binary + + Security considerations: See Section 10 of RFC 8949. + + Interoperability considerations: n/a + + Published specification: RFC 8949 + + Applications that use this media type: Many + + Additional information: + + Magic number(s): n/a + File extension(s): .cbor + Macintosh file type code(s): n/a + + Person & email address to contact for further information: IETF CBOR + Working Group (cbor@ietf.org) or IETF Applications and Real-Time + Area (art@ietf.org) + + Intended usage: COMMON + + Restrictions on usage: none + + Author: IETF CBOR Working Group (cbor@ietf.org) + + Change controller: The IESG (iesg@ietf.org) + +9.4. CoAP Content-Format Registry + + The CoAP Content-Format for CBOR has been registered in the "CoAP + Content-Formats" subregistry within the "Constrained RESTful + Environments (CoRE) Parameters" registry [IANA.core-parameters]: + + Media Type: application/cbor + + Encoding: - + + ID: 60 + + Reference: RFC 8949 + +9.5. Structured Syntax Suffix Registry + + The structured syntax suffix [RFC6838] for media types based on a + single encoded CBOR data item is +cbor, which IANA has registered in + the "Structured Syntax Suffixes" registry [IANA.structured-suffix]: + + Name: Concise Binary Object Representation (CBOR) + + +suffix: +cbor + + References: RFC 8949 + + Encoding Considerations: CBOR is a binary format. + + Interoperability Considerations: n/a + + Fragment Identifier Considerations: The syntax and semantics of + fragment identifiers specified for +cbor SHOULD be as specified + for "application/cbor". (At publication of RFC 8949, there is no + fragment identification syntax defined for "application/cbor".) + + The syntax and semantics for fragment identifiers for a specific + "xxx/yyy+cbor" SHOULD be processed as follows: + + * For cases defined in +cbor, where the fragment identifier + resolves per the +cbor rules, then process as specified in + +cbor. + + * For cases defined in +cbor, where the fragment identifier does + not resolve per the +cbor rules, then process as specified in + "xxx/yyy+cbor". + + * For cases not defined in +cbor, then process as specified in + "xxx/yyy+cbor". + + Security Considerations: See Section 10 of RFC 8949. + + Contact: IETF CBOR Working Group (cbor@ietf.org) or IETF + Applications and Real-Time Area (art@ietf.org) + + Author/Change Controller: IETF + +10. Security Considerations + + A network-facing application can exhibit vulnerabilities in its + processing logic for incoming data. Complex parsers are well known + as a likely source of such vulnerabilities, such as the ability to + remotely crash a node, or even remotely execute arbitrary code on it. + CBOR attempts to narrow the opportunities for introducing such + vulnerabilities by reducing parser complexity, by giving the entire + range of encodable values a meaning where possible. + + Because CBOR decoders are often used as a first step in processing + unvalidated input, they need to be fully prepared for all types of + hostile input that may be designed to corrupt, overrun, or achieve + control of the system decoding the CBOR data item. A CBOR decoder + needs to assume that all input may be hostile even if it has been + checked by a firewall, has come over a secure channel such as TLS, is + encrypted or signed, or has come from some other source that is + presumed trusted. + + Section 4.1 gives examples of limitations in interoperability when + using a constrained CBOR decoder with input from a CBOR encoder that + uses a non-preferred serialization. When a single data item is + consumed both by such a constrained decoder and a full decoder, it + can lead to security issues that can be exploited by an attacker who + can inject or manipulate content. + + As discussed throughout this document, there are many values that can + be considered "equivalent" in some circumstances and "not equivalent" + in others. As just one example, the numeric value for the number + "one" might be expressed as an integer or a bignum. A system + interpreting CBOR input might accept either form for the number + "one", or might reject one (or both) forms. Such acceptance or + rejection can have security implications in the program that is using + the interpreted input. + + Hostile input may be constructed to overrun buffers, to overflow or + underflow integer arithmetic, or to cause other decoding disruption. + CBOR data items might have lengths or sizes that are intentionally + extremely large or too short. Resource exhaustion attacks might + attempt to lure a decoder into allocating very big data items + (strings, arrays, maps, or even arbitrary precision numbers) or + exhaust the stack depth by setting up deeply nested items. Decoders + need to have appropriate resource management to mitigate these + attacks. (Items for which very large sizes are given can also + attempt to exploit integer overflow vulnerabilities.) + + A CBOR decoder, by definition, only accepts well-formed CBOR; this is + the first step to its robustness. Input that is not well-formed CBOR + causes no further processing from the point where the lack of well- + formedness was detected. If possible, any data decoded up to this + point should have no impact on the application using the CBOR + decoder. + + In addition to ascertaining well-formedness, a CBOR decoder might + also perform validity checks on the CBOR data. Alternatively, it can + leave those checks to the application using the decoder. This choice + needs to be clearly documented in the decoder. Beyond the validity + at the CBOR level, an application also needs to ascertain that the + input is in alignment with the application protocol that is + serialized in CBOR. + + The input check itself may consume resources. This is usually linear + in the size of the input, which means that an attacker has to spend + resources that are commensurate to the resources spent by the + defender on input validation. However, an attacker might be able to + craft inputs that will take longer for a target decoder to process + than for the attacker to produce. Processing for arbitrary-precision + numbers may exceed linear effort. Also, some hash-table + implementations that are used by decoders to build in-memory + representations of maps can be attacked to spend quadratic effort, + unless a secret key (see Section 7 of [SIPHASH_LNCS], also + [SIPHASH_OPEN]) or some other mitigation is employed. Such + superlinear efforts can be exploited by an attacker to exhaust + resources at or before the input validator; they therefore need to be + avoided in a CBOR decoder implementation. Note that tag number + definitions and their implementations can add security considerations + of this kind; this should then be discussed in the security + considerations of the tag number definition. + + CBOR encoders do not receive input directly from the network and are + thus not directly attackable in the same way as CBOR decoders. + However, CBOR encoders often have an API that takes input from + another level in the implementation and can be attacked through that + API. The design and implementation of that API should assume the + behavior of its caller may be based on hostile input or on coding + mistakes. It should check inputs for buffer overruns, overflow and + underflow of integer arithmetic, and other such errors that are aimed + to disrupt the encoder. + + Protocols should be defined in such a way that potential multiple + interpretations are reliably reduced to a single interpretation. For + example, an attacker could make use of invalid input such as + duplicate keys in maps, or exploit different precision in processing + numbers to make one application base its decisions on a different + interpretation than the one that will be used by a second + application. To facilitate consistent interpretation, encoder and + decoder implementations should provide a validity-checking mode of + operation (Section 5.4). Note, however, that a generic decoder + cannot know about all requirements that an application poses on its + input data; it is therefore not relieving the application from + performing its own input checking. Also, since the set of defined + tag numbers evolves, the application may employ a tag number that is + not yet supported for validity checking by the generic decoder it + uses. Generic decoders therefore need to document which tag numbers + they support and what validity checking they provide for those tag + numbers as well as for basic CBOR (UTF-8 checking, duplicate map key + checking). + + Section 3.4.3 notes that using the non-preferred choice of a bignum + representation instead of a basic integer for encoding a number is + not intended to have application semantics, but it can have such + semantics if an application receiving CBOR data is using a decoder in + the basic generic data model. This disparity causes a security issue + if the two sets of semantics differ. Thus, applications using CBOR + need to specify the data model that they are using for each use of + CBOR data. + + It is common to convert CBOR data to other formats. In many cases, + CBOR has more expressive types than other formats; this is + particularly true for the common conversion to JSON. The loss of + type information can cause security issues for the systems that are + processing the less-expressive data. + + Section 6.2 describes a possibly common usage scenario of converting + between CBOR and JSON that could allow an attack if the attacker + knows that the application is performing the conversion. + + Security considerations for the use of base16 and base64 from + [RFC4648], and the use of UTF-8 from [RFC3629], are relevant to CBOR + as well. + +11. References + +11.1. Normative References + + [C] International Organization for Standardization, + "Information technology - Programming languages - C", + Fourth Edition, ISO/IEC 9899:2018, June 2018, + . + + [Cplusplus20] + International Organization for Standardization, + "Programming languages - C++", Sixth Edition, ISO/IEC DIS + 14882, ISO/IEC ISO/IEC JTC1 SC22 WG21 N 4860, March 2020, + . + + [IEEE754] IEEE, "IEEE Standard for Floating-Point Arithmetic", IEEE + Std 754-2019, DOI 10.1109/IEEESTD.2019.8766229, + . + + [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail + Extensions (MIME) Part One: Format of Internet Message + Bodies", RFC 2045, DOI 10.17487/RFC2045, November 1996, + . + + [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate + Requirement Levels", BCP 14, RFC 2119, + DOI 10.17487/RFC2119, March 1997, + . + + [RFC3339] Klyne, G. and C. Newman, "Date and Time on the Internet: + Timestamps", RFC 3339, DOI 10.17487/RFC3339, July 2002, + . + + [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO + 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November + 2003, . + + [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform + Resource Identifier (URI): Generic Syntax", STD 66, + RFC 3986, DOI 10.17487/RFC3986, January 2005, + . + + [RFC4287] Nottingham, M., Ed. and R. Sayre, Ed., "The Atom + Syndication Format", RFC 4287, DOI 10.17487/RFC4287, + December 2005, . + + [RFC4648] Josefsson, S., "The Base16, Base32, and Base64 Data + Encodings", RFC 4648, DOI 10.17487/RFC4648, October 2006, + . + + [RFC8126] Cotton, M., Leiba, B., and T. Narten, "Guidelines for + Writing an IANA Considerations Section in RFCs", BCP 26, + RFC 8126, DOI 10.17487/RFC8126, June 2017, + . + + [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC + 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, + May 2017, . + + [TIME_T] The Open Group, "The Open Group Base Specifications", + Section 4.16, 'Seconds Since the Epoch', Issue 7, 2018 + Edition, IEEE Std 1003.1, 2018, + . + +11.2. Informative References + + [ASN.1] International Telecommunication Union, "Information + Technology - ASN.1 encoding rules: Specification of Basic + Encoding Rules (BER), Canonical Encoding Rules (CER) and + Distinguished Encoding Rules (DER)", ITU-T Recommendation + X.690, 2015, + . + + [BSON] Various, "BSON - Binary JSON", . + + [CBOR-TAGS] + Bormann, C., "Notable CBOR Tags", Work in Progress, + Internet-Draft, draft-bormann-cbor-notable-tags-02, 25 + June 2020, . + + [ECMA262] Ecma International, "ECMAScript 2020 Language + Specification", Standard ECMA-262, 11th Edition, June + 2020, . + + [Err3764] RFC Errata, Erratum ID 3764, RFC 7049, + . + + [Err3770] RFC Errata, Erratum ID 3770, RFC 7049, + . + + [Err4294] RFC Errata, Erratum ID 4294, RFC 7049, + . + + [Err4409] RFC Errata, Erratum ID 4409, RFC 7049, + . + + [Err4963] RFC Errata, Erratum ID 4963, RFC 7049, + . + + [Err4964] RFC Errata, Erratum ID 4964, RFC 7049, + . + + [Err5434] RFC Errata, Erratum ID 5434, RFC 7049, + . + + [Err5763] RFC Errata, Erratum ID 5763, RFC 7049, + . + + [Err5917] RFC Errata, Erratum ID 5917, RFC 7049, + . + + [IANA.cbor-simple-values] + IANA, "Concise Binary Object Representation (CBOR) Simple + Values", + . + + [IANA.cbor-tags] + IANA, "Concise Binary Object Representation (CBOR) Tags", + . + + [IANA.core-parameters] + IANA, "Constrained RESTful Environments (CoRE) + Parameters", + . + + [IANA.media-types] + IANA, "Media Types", + . + + [IANA.structured-suffix] + IANA, "Structured Syntax Suffixes", + . + + [MessagePack] + Furuhashi, S., "MessagePack", . + + [PCRE] Hazel, P., "PCRE - Perl Compatible Regular Expressions", + . + + [RFC0713] Haverty, J., "MSDTP-Message Services Data Transmission + Protocol", RFC 713, DOI 10.17487/RFC0713, April 1976, + . + + [RFC6838] Freed, N., Klensin, J., and T. Hansen, "Media Type + Specifications and Registration Procedures", BCP 13, + RFC 6838, DOI 10.17487/RFC6838, January 2013, + . + + [RFC7049] Bormann, C. and P. Hoffman, "Concise Binary Object + Representation (CBOR)", RFC 7049, DOI 10.17487/RFC7049, + October 2013, . + + [RFC7228] Bormann, C., Ersue, M., and A. Keranen, "Terminology for + Constrained-Node Networks", RFC 7228, + DOI 10.17487/RFC7228, May 2014, + . + + [RFC7493] Bray, T., Ed., "The I-JSON Message Format", RFC 7493, + DOI 10.17487/RFC7493, March 2015, + . + + [RFC7991] Hoffman, P., "The "xml2rfc" Version 3 Vocabulary", + RFC 7991, DOI 10.17487/RFC7991, December 2016, + . + + [RFC8259] Bray, T., Ed., "The JavaScript Object Notation (JSON) Data + Interchange Format", STD 90, RFC 8259, + DOI 10.17487/RFC8259, December 2017, + . + + [RFC8610] Birkholz, H., Vigano, C., and C. Bormann, "Concise Data + Definition Language (CDDL): A Notational Convention to + Express Concise Binary Object Representation (CBOR) and + JSON Data Structures", RFC 8610, DOI 10.17487/RFC8610, + June 2019, . + + [RFC8618] Dickinson, J., Hague, J., Dickinson, S., Manderson, T., + and J. Bond, "Compacted-DNS (C-DNS): A Format for DNS + Packet Capture", RFC 8618, DOI 10.17487/RFC8618, September + 2019, . + + [RFC8742] Bormann, C., "Concise Binary Object Representation (CBOR) + Sequences", RFC 8742, DOI 10.17487/RFC8742, February 2020, + . + + [RFC8746] Bormann, C., Ed., "Concise Binary Object Representation + (CBOR) Tags for Typed Arrays", RFC 8746, + DOI 10.17487/RFC8746, February 2020, + . + + [SIPHASH_LNCS] + Aumasson, J. and D. Bernstein, "SipHash: A Fast Short- + Input PRF", Progress in Cryptology - INDOCRYPT 2012, pp. + 489-508, DOI 10.1007/978-3-642-34931-7_28, 2012, + . + + [SIPHASH_OPEN] + Aumasson, J. and D.J. Bernstein, "SipHash: a fast short- + input PRF", . + + [YAML] Ben-Kiki, O., Evans, C., and I.d. Net, "YAML Ain't Markup + Language (YAML[TM]) Version 1.2", 3rd Edition, October + 2009, . + +Appendix A. Examples of Encoded CBOR Data Items + + The following table provides some CBOR-encoded values in hexadecimal + (right column), together with diagnostic notation for these values + (left column). Note that the string "\u00fc" is one form of + diagnostic notation for a UTF-8 string containing the single Unicode + character U+00FC (LATIN SMALL LETTER U WITH DIAERESIS, "ü"). + Similarly, "\u6c34" is a UTF-8 string in diagnostic notation with a + single character U+6C34 (CJK UNIFIED IDEOGRAPH-6C34, "水"), often + representing "water", and "\ud800\udd51" is a UTF-8 string in + diagnostic notation with a single character U+10151 (GREEK ACROPHONIC + ATTIC FIFTY STATERS, "𐅑"). (Note that all these single-character + strings could also be represented in native UTF-8 in diagnostic + notation, just not if an ASCII-only specification is required.) In + the diagnostic notation provided for bignums, their intended numeric + value is shown as a decimal number (such as 18446744073709551616) + instead of a tagged byte string (such as 2(h'010000000000000000')). + + +==============================+====================================+ + |Diagnostic | Encoded | + +==============================+====================================+ + |0 | 0x00 | + +------------------------------+------------------------------------+ + |1 | 0x01 | + +------------------------------+------------------------------------+ + |10 | 0x0a | + +------------------------------+------------------------------------+ + |23 | 0x17 | + +------------------------------+------------------------------------+ + |24 | 0x1818 | + +------------------------------+------------------------------------+ + |25 | 0x1819 | + +------------------------------+------------------------------------+ + |100 | 0x1864 | + +------------------------------+------------------------------------+ + |1000 | 0x1903e8 | + +------------------------------+------------------------------------+ + |1000000 | 0x1a000f4240 | + +------------------------------+------------------------------------+ + |1000000000000 | 0x1b000000e8d4a51000 | + +------------------------------+------------------------------------+ + |18446744073709551615 | 0x1bffffffffffffffff | + +------------------------------+------------------------------------+ + |18446744073709551616 | 0xc249010000000000000000 | + +------------------------------+------------------------------------+ + |-18446744073709551616 | 0x3bffffffffffffffff | + +------------------------------+------------------------------------+ + |-18446744073709551617 | 0xc349010000000000000000 | + +------------------------------+------------------------------------+ + |-1 | 0x20 | + +------------------------------+------------------------------------+ + |-10 | 0x29 | + +------------------------------+------------------------------------+ + |-100 | 0x3863 | + +------------------------------+------------------------------------+ + |-1000 | 0x3903e7 | + +------------------------------+------------------------------------+ + |0.0 | 0xf90000 | + +------------------------------+------------------------------------+ + |-0.0 | 0xf98000 | + +------------------------------+------------------------------------+ + |1.0 | 0xf93c00 | + +------------------------------+------------------------------------+ + |1.1 | 0xfb3ff199999999999a | + +------------------------------+------------------------------------+ + |1.5 | 0xf93e00 | + +------------------------------+------------------------------------+ + |65504.0 | 0xf97bff | + +------------------------------+------------------------------------+ + |100000.0 | 0xfa47c35000 | + +------------------------------+------------------------------------+ + |3.4028234663852886e+38 | 0xfa7f7fffff | + +------------------------------+------------------------------------+ + |1.0e+300 | 0xfb7e37e43c8800759c | + +------------------------------+------------------------------------+ + |5.960464477539063e-8 | 0xf90001 | + +------------------------------+------------------------------------+ + |0.00006103515625 | 0xf90400 | + +------------------------------+------------------------------------+ + |-4.0 | 0xf9c400 | + +------------------------------+------------------------------------+ + |-4.1 | 0xfbc010666666666666 | + +------------------------------+------------------------------------+ + |Infinity | 0xf97c00 | + +------------------------------+------------------------------------+ + |NaN | 0xf97e00 | + +------------------------------+------------------------------------+ + |-Infinity | 0xf9fc00 | + +------------------------------+------------------------------------+ + |Infinity | 0xfa7f800000 | + +------------------------------+------------------------------------+ + |NaN | 0xfa7fc00000 | + +------------------------------+------------------------------------+ + |-Infinity | 0xfaff800000 | + +------------------------------+------------------------------------+ + |Infinity | 0xfb7ff0000000000000 | + +------------------------------+------------------------------------+ + |NaN | 0xfb7ff8000000000000 | + +------------------------------+------------------------------------+ + |-Infinity | 0xfbfff0000000000000 | + +------------------------------+------------------------------------+ + |false | 0xf4 | + +------------------------------+------------------------------------+ + |true | 0xf5 | + +------------------------------+------------------------------------+ + |null | 0xf6 | + +------------------------------+------------------------------------+ + |undefined | 0xf7 | + +------------------------------+------------------------------------+ + |simple(16) | 0xf0 | + +------------------------------+------------------------------------+ + |simple(255) | 0xf8ff | + +------------------------------+------------------------------------+ + |0("2013-03-21T20:04:00Z") | 0xc074323031332d30332d32315432303a | + | | 30343a30305a | + +------------------------------+------------------------------------+ + |1(1363896240) | 0xc11a514b67b0 | + +------------------------------+------------------------------------+ + |1(1363896240.5) | 0xc1fb41d452d9ec200000 | + +------------------------------+------------------------------------+ + |23(h'01020304') | 0xd74401020304 | + +------------------------------+------------------------------------+ + |24(h'6449455446') | 0xd818456449455446 | + +------------------------------+------------------------------------+ + |32("http://www.example.com") | 0xd82076687474703a2f2f7777772e6578 | + | | 616d706c652e636f6d | + +------------------------------+------------------------------------+ + |h'' | 0x40 | + +------------------------------+------------------------------------+ + |h'01020304' | 0x4401020304 | + +------------------------------+------------------------------------+ + |"" | 0x60 | + +------------------------------+------------------------------------+ + |"a" | 0x6161 | + +------------------------------+------------------------------------+ + |"IETF" | 0x6449455446 | + +------------------------------+------------------------------------+ + |"\"\\" | 0x62225c | + +------------------------------+------------------------------------+ + |"\u00fc" | 0x62c3bc | + +------------------------------+------------------------------------+ + |"\u6c34" | 0x63e6b0b4 | + +------------------------------+------------------------------------+ + |"\ud800\udd51" | 0x64f0908591 | + +------------------------------+------------------------------------+ + |[] | 0x80 | + +------------------------------+------------------------------------+ + |[1, 2, 3] | 0x83010203 | + +------------------------------+------------------------------------+ + |[1, [2, 3], [4, 5]] | 0x8301820203820405 | + +------------------------------+------------------------------------+ + |[1, 2, 3, 4, 5, 6, 7, 8, 9, | 0x98190102030405060708090a0b0c0d0e | + |10, 11, 12, 13, 14, 15, 16, | 0f101112131415161718181819 | + |17, 18, 19, 20, 21, 22, 23, | | + |24, 25] | | + +------------------------------+------------------------------------+ + |{} | 0xa0 | + +------------------------------+------------------------------------+ + |{1: 2, 3: 4} | 0xa201020304 | + +------------------------------+------------------------------------+ + |{"a": 1, "b": [2, 3]} | 0xa26161016162820203 | + +------------------------------+------------------------------------+ + |["a", {"b": "c"}] | 0x826161a161626163 | + +------------------------------+------------------------------------+ + |{"a": "A", "b": "B", "c": "C",| 0xa5616161416162614261636143616461 | + |"d": "D", "e": "E"} | 4461656145 | + +------------------------------+------------------------------------+ + |(_ h'0102', h'030405') | 0x5f42010243030405ff | + +------------------------------+------------------------------------+ + |(_ "strea", "ming") | 0x7f657374726561646d696e67ff | + +------------------------------+------------------------------------+ + |[_ ] | 0x9fff | + +------------------------------+------------------------------------+ + |[_ 1, [2, 3], [_ 4, 5]] | 0x9f018202039f0405ffff | + +------------------------------+------------------------------------+ + |[_ 1, [2, 3], [4, 5]] | 0x9f01820203820405ff | + +------------------------------+------------------------------------+ + |[1, [2, 3], [_ 4, 5]] | 0x83018202039f0405ff | + +------------------------------+------------------------------------+ + |[1, [_ 2, 3], [4, 5]] | 0x83019f0203ff820405 | + +------------------------------+------------------------------------+ + |[_ 1, 2, 3, 4, 5, 6, 7, 8, 9, | 0x9f0102030405060708090a0b0c0d0e0f | + |10, 11, 12, 13, 14, 15, 16, | 101112131415161718181819ff | + |17, 18, 19, 20, 21, 22, 23, | | + |24, 25] | | + +------------------------------+------------------------------------+ + |{_ "a": 1, "b": [_ 2, 3]} | 0xbf61610161629f0203ffff | + +------------------------------+------------------------------------+ + |["a", {_ "b": "c"}] | 0x826161bf61626163ff | + +------------------------------+------------------------------------+ + |{_ "Fun": true, "Amt": -2} | 0xbf6346756ef563416d7421ff | + +------------------------------+------------------------------------+ + + Table 6: Examples of Encoded CBOR Data Items + +Appendix B. Jump Table for Initial Byte + + For brevity, this jump table does not show initial bytes that are + reserved for future extension. It also only shows a selection of the + initial bytes that can be used for optional features. (All unsigned + integers are in network byte order.) + + +============+================================================+ + | Byte | Structure/Semantics | + +============+================================================+ + | 0x00..0x17 | unsigned integer 0x00..0x17 (0..23) | + +------------+------------------------------------------------+ + | 0x18 | unsigned integer (one-byte uint8_t follows) | + +------------+------------------------------------------------+ + | 0x19 | unsigned integer (two-byte uint16_t follows) | + +------------+------------------------------------------------+ + | 0x1a | unsigned integer (four-byte uint32_t follows) | + +------------+------------------------------------------------+ + | 0x1b | unsigned integer (eight-byte uint64_t follows) | + +------------+------------------------------------------------+ + | 0x20..0x37 | negative integer -1-0x00..-1-0x17 (-1..-24) | + +------------+------------------------------------------------+ + | 0x38 | negative integer -1-n (one-byte uint8_t for n | + | | follows) | + +------------+------------------------------------------------+ + | 0x39 | negative integer -1-n (two-byte uint16_t for n | + | | follows) | + +------------+------------------------------------------------+ + | 0x3a | negative integer -1-n (four-byte uint32_t for | + | | n follows) | + +------------+------------------------------------------------+ + | 0x3b | negative integer -1-n (eight-byte uint64_t for | + | | n follows) | + +------------+------------------------------------------------+ + | 0x40..0x57 | byte string (0x00..0x17 bytes follow) | + +------------+------------------------------------------------+ + | 0x58 | byte string (one-byte uint8_t for n, and then | + | | n bytes follow) | + +------------+------------------------------------------------+ + | 0x59 | byte string (two-byte uint16_t for n, and then | + | | n bytes follow) | + +------------+------------------------------------------------+ + | 0x5a | byte string (four-byte uint32_t for n, and | + | | then n bytes follow) | + +------------+------------------------------------------------+ + | 0x5b | byte string (eight-byte uint64_t for n, and | + | | then n bytes follow) | + +------------+------------------------------------------------+ + | 0x5f | byte string, byte strings follow, terminated | + | | by "break" | + +------------+------------------------------------------------+ + | 0x60..0x77 | UTF-8 string (0x00..0x17 bytes follow) | + +------------+------------------------------------------------+ + | 0x78 | UTF-8 string (one-byte uint8_t for n, and then | + | | n bytes follow) | + +------------+------------------------------------------------+ + | 0x79 | UTF-8 string (two-byte uint16_t for n, and | + | | then n bytes follow) | + +------------+------------------------------------------------+ + | 0x7a | UTF-8 string (four-byte uint32_t for n, and | + | | then n bytes follow) | + +------------+------------------------------------------------+ + | 0x7b | UTF-8 string (eight-byte uint64_t for n, and | + | | then n bytes follow) | + +------------+------------------------------------------------+ + | 0x7f | UTF-8 string, UTF-8 strings follow, terminated | + | | by "break" | + +------------+------------------------------------------------+ + | 0x80..0x97 | array (0x00..0x17 data items follow) | + +------------+------------------------------------------------+ + | 0x98 | array (one-byte uint8_t for n, and then n data | + | | items follow) | + +------------+------------------------------------------------+ + | 0x99 | array (two-byte uint16_t for n, and then n | + | | data items follow) | + +------------+------------------------------------------------+ + | 0x9a | array (four-byte uint32_t for n, and then n | + | | data items follow) | + +------------+------------------------------------------------+ + | 0x9b | array (eight-byte uint64_t for n, and then n | + | | data items follow) | + +------------+------------------------------------------------+ + | 0x9f | array, data items follow, terminated by | + | | "break" | + +------------+------------------------------------------------+ + | 0xa0..0xb7 | map (0x00..0x17 pairs of data items follow) | + +------------+------------------------------------------------+ + | 0xb8 | map (one-byte uint8_t for n, and then n pairs | + | | of data items follow) | + +------------+------------------------------------------------+ + | 0xb9 | map (two-byte uint16_t for n, and then n pairs | + | | of data items follow) | + +------------+------------------------------------------------+ + | 0xba | map (four-byte uint32_t for n, and then n | + | | pairs of data items follow) | + +------------+------------------------------------------------+ + | 0xbb | map (eight-byte uint64_t for n, and then n | + | | pairs of data items follow) | + +------------+------------------------------------------------+ + | 0xbf | map, pairs of data items follow, terminated by | + | | "break" | + +------------+------------------------------------------------+ + | 0xc0 | text-based date/time (data item follows; see | + | | Section 3.4.1) | + +------------+------------------------------------------------+ + | 0xc1 | epoch-based date/time (data item follows; see | + | | Section 3.4.2) | + +------------+------------------------------------------------+ + | 0xc2 | unsigned bignum (data item "byte string" | + | | follows) | + +------------+------------------------------------------------+ + | 0xc3 | negative bignum (data item "byte string" | + | | follows) | + +------------+------------------------------------------------+ + | 0xc4 | decimal Fraction (data item "array" follows; | + | | see Section 3.4.4) | + +------------+------------------------------------------------+ + | 0xc5 | bigfloat (data item "array" follows; see | + | | Section 3.4.4) | + +------------+------------------------------------------------+ + | 0xc6..0xd4 | (tag) | + +------------+------------------------------------------------+ + | 0xd5..0xd7 | expected conversion (data item follows; see | + | | Section 3.4.5.2) | + +------------+------------------------------------------------+ + | 0xd8..0xdb | (more tags; 1/2/4/8 bytes of tag number and | + | | then a data item follow) | + +------------+------------------------------------------------+ + | 0xe0..0xf3 | (simple value) | + +------------+------------------------------------------------+ + | 0xf4 | false | + +------------+------------------------------------------------+ + | 0xf5 | true | + +------------+------------------------------------------------+ + | 0xf6 | null | + +------------+------------------------------------------------+ + | 0xf7 | undefined | + +------------+------------------------------------------------+ + | 0xf8 | (simple value, one byte follows) | + +------------+------------------------------------------------+ + | 0xf9 | half-precision float (two-byte IEEE 754) | + +------------+------------------------------------------------+ + | 0xfa | single-precision float (four-byte IEEE 754) | + +------------+------------------------------------------------+ + | 0xfb | double-precision float (eight-byte IEEE 754) | + +------------+------------------------------------------------+ + | 0xff | "break" stop code | + +------------+------------------------------------------------+ + + Table 7: Jump Table for Initial Byte + +Appendix C. Pseudocode + + The well-formedness of a CBOR item can be checked by the pseudocode + in Figure 1. The data is well-formed if and only if: + + * the pseudocode does not "fail"; + + * after execution of the pseudocode, no bytes are left in the input + (except in streaming applications). + + The pseudocode has the following prerequisites: + + * take(n) reads n bytes from the input data and returns them as a + byte string. If n bytes are no longer available, take(n) fails. + + * uint() converts a byte string into an unsigned integer by + interpreting the byte string in network byte order. + + * Arithmetic works as in C. + + * All variables are unsigned integers of sufficient range. + + Note that "well_formed" returns the major type for well-formed + definite-length items, but 99 for an indefinite-length item (or -1 + for a "break" stop code, only if "breakable" is set). This is used + in "well_formed_indefinite" to ascertain that indefinite-length + strings only contain definite-length strings as chunks. + + well_formed(breakable = false) { + // process initial bytes + ib = uint(take(1)); + mt = ib >> 5; + val = ai = ib & 0x1f; + switch (ai) { + case 24: val = uint(take(1)); break; + case 25: val = uint(take(2)); break; + case 26: val = uint(take(4)); break; + case 27: val = uint(take(8)); break; + case 28: case 29: case 30: fail(); + case 31: + return well_formed_indefinite(mt, breakable); + } + // process content + switch (mt) { + // case 0, 1, 7 do not have content; just use val + case 2: case 3: take(val); break; // bytes/UTF-8 + case 4: for (i = 0; i < val; i++) well_formed(); break; + case 5: for (i = 0; i < val*2; i++) well_formed(); break; + case 6: well_formed(); break; // 1 embedded data item + case 7: if (ai == 24 && val < 32) fail(); // bad simple + } + return mt; // definite-length data item + } + + well_formed_indefinite(mt, breakable) { + switch (mt) { + case 2: case 3: + while ((it = well_formed(true)) != -1) + if (it != mt) // need definite-length chunk + fail(); // of same type + break; + case 4: while (well_formed(true) != -1); break; + case 5: while (well_formed(true) != -1) well_formed(); break; + case 7: + if (breakable) + return -1; // signal break out + else fail(); // no enclosing indefinite + default: fail(); // wrong mt + } + return 99; // indefinite-length data item + } + + Figure 1: Pseudocode for Well-Formedness Check + + Note that the remaining complexity of a complete CBOR decoder is + about presenting data that has been decoded to the application in an + appropriate form. + + Major types 0 and 1 are designed in such a way that they can be + encoded in C from a signed integer without actually doing an if-then- + else for positive/negative (Figure 2). This uses the fact that + (-1-n), the transformation for major type 1, is the same as ~n + (bitwise complement) in C unsigned arithmetic; ~n can then be + expressed as (-1)^n for the negative case, while 0^n leaves n + unchanged for nonnegative. The sign of a number can be converted to + -1 for negative and 0 for nonnegative (0 or positive) by arithmetic- + shifting the number by one bit less than the bit length of the number + (for example, by 63 for 64-bit numbers). + + void encode_sint(int64_t n) { + uint64t ui = n >> 63; // extend sign to whole length + unsigned mt = ui & 0x20; // extract (shifted) major type + ui ^= n; // complement negatives + if (ui < 24) + *p++ = mt + ui; + else if (ui < 256) { + *p++ = mt + 24; + *p++ = ui; + } else + ... + + Figure 2: Pseudocode for Encoding a Signed Integer + + See Section 1.2 for some specific assumptions about the profile of + the C language used in these pieces of code. + +Appendix D. Half-Precision + + As half-precision floating-point numbers were only added to IEEE 754 + in 2008 [IEEE754], today's programming platforms often still only + have limited support for them. It is very easy to include at least + decoding support for them even without such support. An example of a + small decoder for half-precision floating-point numbers in the C + language is shown in Figure 3. A similar program for Python is in + Figure 4; this code assumes that the 2-byte value has already been + decoded as an (unsigned short) integer in network byte order (as + would be done by the pseudocode in Appendix C). + + #include + + double decode_half(unsigned char *halfp) { + unsigned half = (halfp[0] << 8) + halfp[1]; + unsigned exp = (half >> 10) & 0x1f; + unsigned mant = half & 0x3ff; + double val; + if (exp == 0) val = ldexp(mant, -24); + else if (exp != 31) val = ldexp(mant + 1024, exp - 25); + else val = mant == 0 ? INFINITY : NAN; + return half & 0x8000 ? -val : val; + } + + Figure 3: C Code for a Half-Precision Decoder + + import struct + from math import ldexp + + def decode_single(single): + return struct.unpack("!f", struct.pack("!I", single))[0] + + def decode_half(half): + valu = (half & 0x7fff) << 13 | (half & 0x8000) << 16 + if ((half & 0x7c00) != 0x7c00): + return ldexp(decode_single(valu), 112) + return decode_single(valu | 0x7f800000) + + Figure 4: Python Code for a Half-Precision Decoder + +Appendix E. Comparison of Other Binary Formats to CBOR's Design + Objectives + + The proposal for CBOR follows a history of binary formats that is as + long as the history of computers themselves. Different formats have + had different objectives. In most cases, the objectives of the + format were never stated, although they can sometimes be implied by + the context where the format was first used. Some formats were meant + to be universally usable, although history has proven that no binary + format meets the needs of all protocols and applications. + + CBOR differs from many of these formats due to it starting with a set + of objectives and attempting to meet just those. This section + compares a few of the dozens of formats with CBOR's objectives in + order to help the reader decide if they want to use CBOR or a + different format for a particular protocol or application. + + Note that the discussion here is not meant to be a criticism of any + format: to the best of our knowledge, no format before CBOR was meant + to cover CBOR's objectives in the priority we have assigned them. A + brief recap of the objectives from Section 1.1 is: + + 1. unambiguous encoding of most common data formats from Internet + standards + + 2. code compactness for encoder or decoder + + 3. no schema description needed + + 4. reasonably compact serialization + + 5. applicability to constrained and unconstrained applications + + 6. good JSON conversion + + 7. extensibility + + A discussion of CBOR and other formats with respect to a different + set of design objectives is provided in Section 5 and Appendix C of + [RFC8618]. + +E.1. ASN.1 DER, BER, and PER + + [ASN.1] has many serializations. In the IETF, DER and BER are the + most common. The serialized output is not particularly compact for + many items, and the code needed to decode numeric items can be + complex on a constrained device. + + Few (if any) IETF protocols have adopted one of the several variants + of Packed Encoding Rules (PER). There could be many reasons for + this, but one that is commonly stated is that PER makes use of the + schema even for parsing the surface structure of the data item, + requiring significant tool support. There are different versions of + the ASN.1 schema language in use, which has also hampered adoption. + +E.2. MessagePack + + [MessagePack] is a concise, widely implemented counted binary + serialization format, similar in many properties to CBOR, although + somewhat less regular. While the data model can be used to represent + JSON data, MessagePack has also been used in many remote procedure + call (RPC) applications and for long-term storage of data. + + MessagePack has been essentially stable since it was first published + around 2011; it has not yet had a transition. The evolution of + MessagePack is impeded by an imperative to maintain complete + backwards compatibility with existing stored data, while only few + bytecodes are still available for extension. Repeated requests over + the years from the MessagePack user community to separate out binary + and text strings in the encoding recently have led to an extension + proposal that would leave MessagePack's "raw" data ambiguous between + its usages for binary and text data. The extension mechanism for + MessagePack remains unclear. + +E.3. BSON + + [BSON] is a data format that was developed for the storage of JSON- + like maps (JSON objects) in the MongoDB database. Its major + distinguishing feature is the capability for in-place update, which + prevents a compact representation. BSON uses a counted + representation except for map keys, which are null-byte terminated. + While BSON can be used for the representation of JSON-like objects on + the wire, its specification is dominated by the requirements of the + database application and has become somewhat baroque. The status of + how BSON extensions will be implemented remains unclear. + +E.4. MSDTP: RFC 713 + + Message Services Data Transmission (MSDTP) is a very early example of + a compact message format; it is described in [RFC0713], written in + 1976. It is included here for its historical value, not because it + was ever widely used. + +E.5. Conciseness on the Wire + + While CBOR's design objective of code compactness for encoders and + decoders is a higher priority than its objective of conciseness on + the wire, many people focus on the wire size. Table 8 shows some + encoding examples for the simple nested array [1, [2, 3]]; where some + form of indefinite-length encoding is supported by the encoding, + [_ 1, [2, 3]] (indefinite length on the outer array) is also shown. + + +=============+============================+================+ + | Format | [1, [2, 3]] | [_ 1, [2, 3]] | + +=============+============================+================+ + | RFC 713 | c2 05 81 c2 02 82 83 | | + +-------------+----------------------------+----------------+ + | ASN.1 BER | 30 0b 02 01 01 30 06 02 01 | 30 80 02 01 01 | + | | 02 02 01 03 | 30 06 02 01 02 | + | | | 02 01 03 00 00 | + +-------------+----------------------------+----------------+ + | MessagePack | 92 01 92 02 03 | | + +-------------+----------------------------+----------------+ + | BSON | 22 00 00 00 10 30 00 01 00 | | + | | 00 00 04 31 00 13 00 00 00 | | + | | 10 30 00 02 00 00 00 10 31 | | + | | 00 03 00 00 00 00 00 | | + +-------------+----------------------------+----------------+ + | CBOR | 82 01 82 02 03 | 9f 01 82 02 03 | + | | | ff | + +-------------+----------------------------+----------------+ + + Table 8: Examples for Different Levels of Conciseness + +Appendix F. Well-Formedness Errors and Examples + + There are three basic kinds of well-formedness errors that can occur + in decoding a CBOR data item: + + Too much data: There are input bytes left that were not consumed. + This is only an error if the application assumed that the input + bytes would span exactly one data item. Where the application + uses the self-delimiting nature of CBOR encoding to permit + additional data after the data item, as is done in CBOR sequences + [RFC8742], for example, the CBOR decoder can simply indicate which + part of the input has not been consumed. + + Too little data: The input data available would need additional + bytes added at their end for a complete CBOR data item. This may + indicate the input is truncated; it is also a common error when + trying to decode random data as CBOR. For some applications, + however, this may not actually be an error, as the application may + not be certain it has all the data yet and can obtain or wait for + additional input bytes. Some of these applications may have an + upper limit for how much additional data can appear; here the + decoder may be able to indicate that the encoded CBOR data item + cannot be completed within this limit. + + Syntax error: The input data are not consistent with the + requirements of the CBOR encoding, and this cannot be remedied by + adding (or removing) data at the end. + + In Appendix C, errors of the first kind are addressed in the first + paragraph and bullet list (requiring "no bytes are left"), and errors + of the second kind are addressed in the second paragraph/bullet list + (failing "if n bytes are no longer available"). Errors of the third + kind are identified in the pseudocode by specific instances of + calling fail(), in order: + + * a reserved value is used for additional information (28, 29, 30) + + * major type 7, additional information 24, value < 32 (incorrect) + + * incorrect substructure of indefinite-length byte string or text + string (may only contain definite-length strings of the same major + type) + + * "break" stop code (major type 7, additional information 31) occurs + in a value position of a map or except at a position directly in + an indefinite-length item where also another enclosed data item + could occur + + * additional information 31 used with major type 0, 1, or 6 + +F.1. Examples of CBOR Data Items That Are Not Well-Formed + + This subsection shows a few examples for CBOR data items that are not + well-formed. Each example is a sequence of bytes, each shown in + hexadecimal; multiple examples in a list are separated by commas. + + Examples for well-formedness error kind 1 (too much data) can easily + be formed by adding data to a well-formed encoded CBOR data item. + + Similarly, examples for well-formedness error kind 2 (too little + data) can be formed by truncating a well-formed encoded CBOR data + item. In test suites, it may be beneficial to specifically test with + incomplete data items that would require large amounts of addition to + be completed (for instance by starting the encoding of a string of a + very large size). + + A premature end of the input can occur in a head or within the + enclosed data, which may be bare strings or enclosed data items that + are either counted or should have been ended by a "break" stop code. + + End of input in a head: 18, 19, 1a, 1b, 19 01, 1a 01 02, 1b 01 02 03 + 04 05 06 07, 38, 58, 78, 98, 9a 01 ff 00, b8, d8, f8, f9 00, fa 00 + 00, fb 00 00 00 + + Definite-length strings with short data: 41, 61, 5a ff ff ff ff 00, + 5b ff ff ff ff ff ff ff ff 01 02 03, 7a ff ff ff ff 00, 7b 7f ff + ff ff ff ff ff ff 01 02 03 + + Definite-length maps and arrays not closed with enough items: 81, 81 + 81 81 81 81 81 81 81 81, 82 00, a1, a2 01 02, a1 00, a2 00 00 00 + + Tag number not followed by tag content: c0 + + Indefinite-length strings not closed by a "break" stop code: 5f 41 + 00, 7f 61 00 + + Indefinite-length maps and arrays not closed by a "break" stop + code: 9f, 9f 01 02, bf, bf 01 02 01 02, 81 9f, 9f 80 00, 9f 9f 9f 9f + 9f ff ff ff ff, 9f 81 9f 81 9f 9f ff ff ff + + A few examples for the five subkinds of well-formedness error kind 3 + (syntax error) are shown below. + + Subkind 1: + Reserved additional information values: 1c, 1d, 1e, 3c, 3d, 3e, + 5c, 5d, 5e, 7c, 7d, 7e, 9c, 9d, 9e, bc, bd, be, dc, dd, de, fc, + fd, fe, + + Subkind 2: + Reserved two-byte encodings of simple values: f8 00, f8 01, f8 + 18, f8 1f + + Subkind 3: + Indefinite-length string chunks not of the correct type: 5f 00 + ff, 5f 21 ff, 5f 61 00 ff, 5f 80 ff, 5f a0 ff, 5f c0 00 ff, 5f + e0 ff, 7f 41 00 ff + + Indefinite-length string chunks not definite length: 5f 5f 41 00 + ff ff, 7f 7f 61 00 ff ff + + Subkind 4: + Break occurring on its own outside of an indefinite-length + item: ff + + Break occurring in a definite-length array or map or a tag: 81 + ff, 82 00 ff, a1 ff, a1 ff 00, a1 00 ff, a2 00 00 ff, 9f 81 ff, + 9f 82 9f 81 9f 9f ff ff ff ff + + Break in an indefinite-length map that would lead to an odd + number of items (break in a value position): bf 00 ff, bf 00 00 + 00 ff + + Subkind 5: + Major type 0, 1, 6 with additional information 31: 1f, 3f, df + +Appendix G. Changes from RFC 7049 + + As discussed in the introduction, this document formally obsoletes + RFC 7049 while keeping full compatibility with the interchange format + from RFC 7049. This document provides editorial improvements, added + detail, and fixed errata. This document does not create a new + version of the format. + +G.1. Errata Processing and Clerical Changes + + The two verified errata on RFC 7049, [Err3764] and [Err3770], + concerned two encoding examples in the text that have been corrected + (Section 3.4.3: "29" -> "49", Section 5.5: "0b000_11101" -> + "0b000_11001"). Also, RFC 7049 contained an example using the + numeric value 24 for a simple value [Err5917], which is not well- + formed; this example has been removed. Errata report 5763 [Err5763] + pointed to an error in the wording of the definition of tags; this + was resolved during a rewrite of Section 3.4. Errata report 5434 + [Err5434] pointed out that the Universal Binary JSON (UBJSON) example + in Appendix E no longer complied with the version of UBJSON current + at the time of the errata report submission. It turned out that the + UBJSON specification had completely changed since 2013; this example + therefore was removed. Other errata reports [Err4409] [Err4963] + [Err4964] complained that the map key sorting rules for canonical + encoding were onerous; these led to a reconsideration of the + canonical encoding suggestions and replacement by the deterministic + encoding suggestions (described below). An editorial suggestion in + errata report 4294 [Err4294] was also implemented (improved symmetry + by adding "Second value" to a comment to the last example in + Section 3.2.2). + + Other clerical changes include: + + * the use of new xml2rfc functionality [RFC7991]; + + * more explanation of the notation used; + + * the update of references, e.g., from RFC 4627 to [RFC8259], from + CNN-TERMS to [RFC7228], and from the 5.1 edition to the 11th + edition of [ECMA262]; the addition of a reference to [IEEE754] and + importation of required definitions; the addition of references to + [C] and [Cplusplus20]; and the addition of a reference to + [RFC8618] that further illustrates the discussion in Appendix E; + + * in the discussion of diagnostic notation (Section 8), the + "Extended Diagnostic Notation" (EDN) defined in [RFC8610] is now + mentioned, the gap in representing NaN payloads is now + highlighted, and an explanation of representing indefinite-length + strings with no chunks has been added (Section 8.1); + + * the addition of this appendix. + +G.2. Changes in IANA Considerations + + The IANA considerations were generally updated (clerical changes, + e.g., now pointing to the CBOR Working Group as the author of the + specification). References to the respective IANA registries were + added to the informative references. + + In the "Concise Binary Object Representation (CBOR) Tags" registry + [IANA.cbor-tags], tags in the space from 256 to 32767 (lower half of + "1+2") are no longer assigned by First Come First Served; this range + is now Specification Required. + +G.3. Changes in Suggestions and Other Informational Components + + While revising the document, beyond the addressing of the errata + reports, the working group drew upon nearly seven years of experience + with CBOR in a diverse set of applications. This led to a number of + editorial changes, including adding tables for illustration, but also + emphasizing some aspects and de-emphasizing others. + + A significant addition is Section 2, which discusses the CBOR data + model and its small variations involved in the processing of CBOR. + The introduction of terms for those variations (basic generic, + extended generic, specific) enables more concise language in other + places of the document and also helps to clarify expectations of + implementations and of the extensibility features of the format. + + As a format derived from the JSON ecosystem, RFC 7049 was influenced + by the JSON number system that was in turn inherited from JavaScript + at the time. JSON does not provide distinct integers and floating- + point values (and the latter are decimal in the format). CBOR + provides binary representations of numbers, which do differ between + integers and floating-point values. Experience from implementation + and use suggested that the separation between these two number + domains should be more clearly drawn in the document; language that + suggested an integer could seamlessly stand in for a floating-point + value was removed. Also, a suggestion (based on I-JSON [RFC7493]) + was added for handling these types when converting JSON to CBOR, and + the use of a specific rounding mechanism has been recommended. + + For a single value in the data model, CBOR often provides multiple + encoding options. A new section (Section 4) introduces the term + "preferred serialization" (Section 4.1) and defines it for various + kinds of data items. On the basis of this terminology, the section + then discusses how a CBOR-based protocol can define "deterministic + encoding" (Section 4.2), which avoids terms "canonical" and + "canonicalization" from RFC 7049. The suggestion of "Core + Deterministic Encoding Requirements" (Section 4.2.1) enables generic + support for such protocol-defined encoding requirements. This + document further eases the implementation of deterministic encoding + by simplifying the map ordering suggested in RFC 7049 to a simple + lexicographic ordering of encoded keys. A description of the older + suggestion is kept as an alternative, now termed "length-first map + key ordering" (Section 4.2.3). + + The terminology for well-formed and valid data was sharpened and more + stringently used, avoiding less well-defined alternative terms such + as "syntax error", "decoding error", and "strict mode" outside of + examples. Also, a third level of requirements that an application + has on its input data beyond CBOR-level validity is now explicitly + called out. Well-formed (processable at all), valid (checked by a + validity-checking generic decoder), and expected input (as checked by + the application) are treated as a hierarchy of layers of + acceptability. + + The handling of non-well-formed simple values was clarified in text + and pseudocode. Appendix F was added to discuss well-formedness + errors and provide examples for them. The pseudocode was updated to + be more portable, and some portability considerations were added. + + The discussion of validity has been sharpened in two areas. Map + validity (handling of duplicate keys) was clarified, and the domain + of applicability of certain implementation choices explained. Also, + while streamlining the terminology for tags, tag numbers, and tag + content, discussion was added on tag validity, and the restrictions + were clarified on tag content, in general and specifically for tag 1. + + An implementation note (and note for future tag definitions) was + added to Section 3.4 about defining tags with semantics that depend + on serialization order. + + Tag 35 is not defined by this document; the registration based on the + definition in RFC 7049 remains in place. + + Terminology was introduced in Section 3 for "argument" and "head", + simplifying further discussion. + + The security considerations (Section 10) were mostly rewritten and + significantly expanded; in multiple other places, the document is now + more explicit that a decoder cannot simply condone well-formedness + errors. + +Acknowledgements + + CBOR was inspired by MessagePack. MessagePack was developed and + promoted by Sadayuki Furuhashi ("frsyuki"). This reference to + MessagePack is solely for attribution; CBOR is not intended as a + version of, or replacement for, MessagePack, as it has different + design goals and requirements. + + The need for functionality beyond the original MessagePack + specification became obvious to many people at about the same time + around the year 2012. BinaryPack is a minor derivation of + MessagePack that was developed by Eric Zhang for the binaryjs + project. A similar, but different, extension was made by Tim Caswell + for his msgpack-js and msgpack-js-browser projects. Many people have + contributed to the discussion about extending MessagePack to separate + text string representation from byte string representation. + + The encoding of the additional information in CBOR was inspired by + the encoding of length information designed by Klaus Hartke for CoAP. + + This document also incorporates suggestions made by many people, + notably Dan Frost, James Manger, Jeffrey Yasskin, Joe Hildebrand, + Keith Moore, Laurence Lundblade, Matthew Lepinski, Michael + Richardson, Nico Williams, Peter Occil, Phillip Hallam-Baker, Ray + Polk, Stuart Cheshire, Tim Bray, Tony Finch, Tony Hansen, and Yaron + Sheffer. Benjamin Kaduk provided an extensive review during IESG + processing. Éric Vyncke, Erik Kline, Robert Wilton, and Roman Danyliw + provided further IESG comments, which included an IoT directorate + review by Eve Schooler. + +Authors' Addresses + + Carsten Bormann + Universität Bremen TZI + Postfach 330440 + D-28359 Bremen + Germany + + Phone: +49-421-218-63921 + Email: cabo@tzi.org + + + Paul Hoffman + ICANN + + Email: paul.hoffman@icann.org -- cgit v1.2.3