diff options
Diffstat (limited to 'doc/rfc/rfc2803.txt')
-rw-r--r-- | doc/rfc/rfc2803.txt | 619 |
1 files changed, 619 insertions, 0 deletions
diff --git a/doc/rfc/rfc2803.txt b/doc/rfc/rfc2803.txt new file mode 100644 index 0000000..3d1df4a --- /dev/null +++ b/doc/rfc/rfc2803.txt @@ -0,0 +1,619 @@ + + + + + + +Network Working Group H. Maruyama +Request for Comments: 2803 K. Tamura +Category: Informational N. Uramoto + IBM + April 2000 + + + Digest Values for DOM (DOMHASH) + +Status of this Memo + + This memo provides information for the Internet community. It does + not specify an Internet standard of any kind. Distribution of this + memo is unlimited. + +Copyright Notice + + Copyright (C) The Internet Society (2000). All Rights Reserved. + +Abstract + + This memo defines a clear and unambiguous definition of digest (hash) + values of the XML objects regardless of the surface string variation + of XML. This definition can be used for XML digital signature as well + efficient replication of XML objects. + +Table of Contents + + 1. Introduction............................................2 + 2. Digest Calculation......................................3 + 2.1. Overview..............................................3 + 2.2. Namespace Considerations..............................4 + 2.3. Definition with Code Fragments........................5 + 2.3.1. Text Nodes..........................................5 + 2.3.2. Processing Instruction Nodes........................6 + 2.3.3. Attr Nodes..........................................6 + 2.3.4. Element Nodes.......................................7 + 2.3.5. Document Nodes......................................9 + 3. Discussion..............................................9 + 4. Security Considerations.................................9 + References................................................10 + Authors' Addresses........................................10 + Full Copyright Statement..................................11 + + + + + + + + +Maruyama, et al. Informational [Page 1] + +RFC 2803 Digest Values for DOM (DOMHASH) April 2000 + + +1. Introduction + + The purpose of this document is to give a clear and unambiguous + definition of digest (hash) values of the XML objects [XML]. Two + subtrees are considered identical if their hash values are the same, + and different if their hash values are different. + + There are at least two usage scenarios of DOMHASH. One is as a basis + for digital signatures for XML. Digital signature algorithms normally + require hashing a signed content before signing. DOMHASH provides a + concrete definition of the hash value calculation. + + The other is to use DOMHASH when synchronizing two DOM structures + [DOM]. Suppose that a server program generates a DOM structure which + is to be rendered by clients. If the server makes frequent small + changes on a large DOM tree, it is desirable that only the modified + parts are sent over to the client. A client can initiate a request by + sending the root hash value of the structure in the cache memory. If + it matches with the root hash value of the current server structure, + nothing needs be sent. If not, then the server compares the client + hash with the older versions in the server's cache. If it finds one + that matches the client's version of the structure, then it locates + differences with the current version by recursively comparing the + hash values of each node. This way, the client can receive only an + updated portion of a large structure without requesting the whole + thing. + + One way of defining digest values is to take a surface string as the + input for a digest algorithm. However, this approach has several + drawbacks. The same internal DOM structure may be represented in may + different ways as surface strings even if they strictly conform to + the XML specification. Treatment of white spaces, selection of + character encodings, entity references (i.e., use of ampersands), and + so on have impact on the generation of a surface string. If the + implementations of surface string generation are different, the hash + values would be different, resulting in unvalidatable digital + signatures and unsuccessful detection of identical DOM structures. + Therefore, it is desirable that digest of DOM is defined in the DOM + terms -- that is, as an unambiguous algorithm operating on a DOM + tree. This is the approach we take in this specification. + + Introduction of namespace is another source of variation of surface + string because different namespace prefixes can be used for + representing the same namespace URI [URI]. In the following example, + the namespace prefix "edi" is bound to the URI + "http://ecommerce.org/schema" but this prefix can be arbitrary chosen + without changing the logical contents as shown in the second example. + + + + +Maruyama, et al. Informational [Page 2] + +RFC 2803 Digest Values for DOM (DOMHASH) April 2000 + + + <?xml version="1.0"?> + <root xmlns:edi='http://ecommerce.org/schema'> + <edi:order> + : + </edi:order> + </root> + + + <?xml version="1.0"?> + <root xmlns:ec='http://ecommerce.org/schema'> + <ec:order> + : + </ec:order> + </root> + + The DOMHASH defined in this document is designed so that the choice + of the namespace prefix does not affect the digest value. In the + above example, both the "root" elements will get the same digest + value. + +2. Digest Calculation + +2.1. Overview + + Hash values are defined on the DOM type Node. We consider the + following five node types that are used for representing a DOM + document structure: + + - Text + - ProcessingInstruction + - Attr + - Element + - Document + + Comment nodes and Document Type Definitions (DTDs) do not participate + in the digest value calculation. This is because DOM does not + require a conformant processor to create data structures for these. + DOMHASH is designed so that it can be computed with any XML processor + conformant to the DOM or SAX [SAX] specification. + + Nodes with the node type EntityReference must be expanded prior to + digest calculation. + + The digest values are defined recursively on each level of the DOM + tree so that only a relevant part needs to be recalculated when a + small portion of the tree is changed. + + + + + +Maruyama, et al. Informational [Page 3] + +RFC 2803 Digest Values for DOM (DOMHASH) April 2000 + + + Below, we give the precise definitions of digest for these types. We + describe the format of the data to be supplied to a hash algorithm + using a figure and a simple description, followed by a Java code + fragment using the DOM API and the JDK 1.1 Platform Core API only. + Therefore, the semantics should be unambiguous. + + As the rule of thumb, all strings are to be in UTF-16BE [UTF16]. If + there is a sequence of text nodes without any element nodes in + between, these text nodes are merged into one by concatenating them. + A zero-length text node is always ignored. + + Note that validating and non-validating XML processors may generate + different DOM trees from the same XML document, due to attribute + normalization and default attributes. If DOMHASH is to be used for + testing logical equivalence between two XML documents (as opposed to + DOM trees), it may be necessary to normalize attributes and supply + default attributes prior to DOMHASH calculation. + + Some legacy character encodings (such as ISO-2022-JP) have certain + ambiguity in translating into Unicode. This is again dependent on + XML processors. Treatment of such processor dependencies is out of + scope of this document. + +2.2. Namespace Considerations + + To avoid the dependence on the namespace prefix, we use "expanded + names" to do digest calculation. If an element name or an attribute + name is qualified either by a explicit namespace prefix or by a + default namespace, the name's LocalPart is prepended by the URI of + the namespace (the namespace name as defined in the Namespace + specification [NAM]) and a colon before digest calculation. In the + following example, the default qualified name "order" is expanded + into "http://ecommerce.org/schema:order" while the explicit qualified + name "book:title" is expanded into "urn:loc.gov:books:title" before + digest calculation. + + <?xml version="1.0"?> + + <root xmlns='http://ecommerce.org/schema' + xmlns:book='urn:loc.gov:books'> + <order> + <book:title> ... </book:title> + : + </order> + </root> + + + + + + +Maruyama, et al. Informational [Page 4] + +RFC 2803 Digest Values for DOM (DOMHASH) April 2000 + + + We define an expanded name (either for element or attribute) as + follows: + + If a name is not qualified, the expanded name is the name itself. + + If a name is qualified with the prefix "xmlns", the expanded name + is undefined. + + If a name is qualified either by default or by an explicit + namespace prefix, the expanded name is URI bound to the namespace + + ":" + LocalPart + + In the following example code, we assume that the getExpandedName() + method (which returns the expanded name as defined above) is defined + in both Element and Attr interfaces of DOM. + + Note that the digest values are not defined on namespace + declarations. In other words, the digest value is not defined for an + attribute when + + - the attribute name is "xmlns", or + - the namespace prefix is "xmlns". + + In the above example, the two attributes which are namespace + declarations do not have digest values and therefore will not + participate in the calculation of the digest value of the "root" + element. + +2.3. Definition with Code Fragments + + The code fragments in the definitions below assume that they are in + implementation classes of Node. Therefore, a methods call without an + explicit object reference is for the Node itself. For example, + getData() returns the text data of the current node if it is a Text + node. The parameter digestAlgorithm is to be replaced by an + identifier of the digest algorithm, such as "MD5" [MD5] and "SHA-1" + [SHA]. + + The computation should begin with a four byte integer that represents + the type of the node, such as TEXT_NODE or ELEMENT_NODE. + +2.3.1. Text Nodes + + The hash value of a Text node is computed on the four byte header + followed by the UTF-16BE encoded text string. + + - TEXT_NODE (3) in 32 bit network-byte-ordered integer + - Text data in UTF-16BE stream (variable length) + + + +Maruyama, et al. Informational [Page 5] + +RFC 2803 Digest Values for DOM (DOMHASH) April 2000 + + + public byte[] getDigest(String digestAlgorithm) { + MessageDigest md = MessageDigest.getInstance(digestAlgorithm); + md.update((byte)0); + md.update((byte)0); + md.update((byte)0); + md.update((byte)3); + md.update(getData().getBytes("UnicodeBigUnmarked")); + return md.digest(); + } + + Here, MessageDigest is in the package java.security.*, one of the + built-in packages of JDK 1.1. + +2.3.2. ProcessingInstruction Nodes + + A ProcessingInstruction (PI) node has two components: the target and + the data. Accordingly, the hash is computed on the concatenation of + both, separated by 'x0000'. PI data is from the first non white + space character after the target to the character immediately + preceding the "?>". + + - PROCESSING_INSTRUCTION_NODE (7) in 32 bit network-byte-ordered + integer + - PI target in UTF-16BE stream (variable length) + - 0x00 0x00 + - PI data in UTF-16BE stream (variable length) + + public byte[] getDigest(String digestAlgorithm) { + MessageDigest md = MessageDigest.getInstance(digestAlgorithm); + md.update((byte)0); + md.update((byte)0); + md.update((byte)0); + md.update((byte)7); + md.update(getName().getBytes("UnicodeBigUnmarked")); + md.update((byte)0); + md.update((byte)0); + md.update(getData().getBytes("UnicodeBigUnmarked")); + return md.digest(); + } + +2.3.3. Attr Nodes + + The digest value of Attr nodes are defined similarly to PI nodes, + except that we need a separator between the expanded attribute name + and the attribute value. The '0x0000' value in UTF-16BE is allowed + nowhere in an XML document, so it can serve as an unambiguous + separator. The expanded name must be used as the attribute name + because it may be qualified. Note that if the attribute is a + + + +Maruyama, et al. Informational [Page 6] + +RFC 2803 Digest Values for DOM (DOMHASH) April 2000 + + + namespace declaration (either the attribute name is "xmlns" or its + prefix is "xmlns"), the digest value is undefined and the getDigest() + method should return null. + + - ATTRIBUTE_NODE (2) in 32 bit network-byte-ordered integer + - Expanded attribute name in UTF-16BE stream (variable length) + - 0x00 0x00 + - Attribute value in UTF-16BE stream (variable length) + + public byte[] getDigest(String digestAlgorithm) { + if (getNodeName().equals("xmlns") + || getNodeName().startsWith("xmlns:")) + return null; + MessageDigest md = MessageDigest.getInstance(digestAlgorithm); + md.update((byte)0); + md.update((byte)0); + md.update((byte)0); + md.update((byte)2); + md.update(getExpandedName().getBytes("UnicodeBigUnmarked")); + md.update((byte)0); + md.update((byte)0); + md.update(getValue().getBytes("UnicodeBigUnmarked")); + return md.digest(); + } + +2.3.4. Element Nodes + + Element nodes are the most complex because they consist of other + nodes recursively. Hash values of these component nodes are used to + calculate the node's digest so that we can save computation when the + structure is partially changed. + + First, all the attributes except for namespace declarations must be + collected. This list is sorted lexicographically by the expanded + attribute names (based on Unicode character code points). When no + surrogate characters are involved, this is the same as sorting in + ascending order in terms of the UTF-16BE encoded expanded attribute + names, using the string comparison operator String.compareTo() in + Java. + + - ELEMENT_NODE (1) in 32 bit network-byte-ordered integer + - Expanded element name in UTF-16BE stream (variable length) + - 0x00 0x00 + - A number of non-namespace-declaration attributes in 32 bit + network-byte-ordered unsigned integer + - Sequence of digest values of non-namespace-declaration attributes, + sorted lexicographically by expanded attribute names + - A number of child nodes (except for Comment nodes) in 32bit + + + +Maruyama, et al. Informational [Page 7] + +RFC 2803 Digest Values for DOM (DOMHASH) April 2000 + + + network-byte-ordered unsigned integer + - Sequence of digest values of each child node except for Comment + nodes (variable length) (A sequence of child texts is merged to one + text. A zero-length text and Comment nodes are not counted as + child) + + public byte[] getDigest(String digestAlgorithm) { + MessageDigest md = MessageDigest.getInstance(digestAlgorithm); + ByteArrayOutputStream baos = new ByteArrayOutputStream(); + DataOutputStream dos = new DataOutputStream(baos); + dos.writeInt(ELEMENT_NODE);//This is stored in network byte order + dos.write(getExpandedName().getBytes("UnicodeBigUnmarked")); + dos.write((byte)0); + dos.write((byte)0); + // Collect all attributes except for namespace declarations + NamedNodeMap nnm = this.getAttributes(); + int len = nnm.getLength() + // Find "xmlns" or "xmlns:foo" in nnm and omit it. + ... + dos.writeInt(len); // This is sorted in the network byte order + // Sort attributes lexicographically by expanded attribute + // names. + ... + // Assume that `Attr[] aattr' has sorted Attribute instances. + for (int i = 0; i < len; i ++) + dos.write(aattr[i].getDigest(digestAlgorithm)); + Node n = this.getFirstChild(); + // Assume that adjoining Texts are merged, + // there is no 0-length Text, and + // comment nodes are removed. + len = this.getChildNodes().getLength(); + dos.writeInt(len); // This is stored in the network byte order + while (n != null) { + dos.write(n.getDigest(digestAlgorithm)); + n = n.getNextSibling(); + } + dos.close(); + md.update(baos.toByteArray()); + return md.digest(); + } + + + + + + + + + + + +Maruyama, et al. Informational [Page 8] + +RFC 2803 Digest Values for DOM (DOMHASH) April 2000 + + +2.3.5. Document Nodes + + A Document node may have PI nodes before and after the root Element + node. The digest value of a Document node is computed based on the + sequence of the digest values of the pre-root PI nodes, the root + Element node, and the post-root PI nodes in this order. Comment + nodes and DocumentType nodes, if any, are ignored. + + - DOCUMENT_NODE (9) in 32 bit network-byte-ordered integer + - A number of child nodes (except for Comment and DocumentType nodes) + in 32bit network-byte-ordered unsigned integer + - Sequence of digest values of each child node except for Comment and + DocumentType nodes (variable length) + + public byte[] getDigest(String digestAlgorithm) { + MessageDigest md = MessageDigest.getInstance(digestAlgorithm); + ByteArrayOutputStream baos = new ByteArrayOutputStream(); + DataOutputStream dos = new DataOutputStream(baos); + dos.writeInt(DOCUMENT_NODE);//This is stored in network byte order + + // Assume that Comment and DocumentType nodes are removed and this + // node has only an Element node and PI nodes. + len = this.getChildNodes().getLength(); + dos.writeInt(len); // This is stored in the network byte order + Node n = this.getFirstChild(); + while (n != null) { + dos.write(n.getDigest(digestAlgorithm)); + n = n.getNextSibling(); + } + dos.close(); + md.update(baos.toByteArray()); + return md.digest(); + } + +3. Discussion + + The definition described above can be efficiently implemented with + any XML processor that is conformant to either DOM and SAX + specification. Reference implementations are available on request. + +4. Security Considerations + + DOMHASH is expected to be used as the basis for digital signatures + and other security and integrity uses. It's appropriateness for + such uses depends on the security of the hash algorithm used and + inclusion of the fundamental characteristics it is desired to check + in parts of the DOM model incorporated in the digest by DOMHASH. + + + + +Maruyama, et al. Informational [Page 9] + +RFC 2803 Digest Values for DOM (DOMHASH) April 2000 + + +References + + [DOM] "Document Object Model (DOM), Level 1 Specification", October + 1998, http://www.w3.org/TR/REC-DOM-Level-1/ + + [MD5] Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321, + April 1992. + + [NAM] Tim Bray, Dave Hollander, Andrew Layman, "Namespaces in XML", + http://www.w3.org/TR/1999/REC-xml-names-19990114. + + [SAX] David Megginson, "SAX 1.0: The Simple API for XML", + http://www.megginson.com/SAX/, May 1998. + + [SHA] (US) National Institute of Standards and Technology, "Federal + Information Processing Standards Publication 180-1: Secure Hash + Standard", 17 April 1995. + + [URI] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform + Resource Identifiers (URI): Generic Syntax", RFC 2396, August + 1998. + + [UTF16] Hoffman, P., Yergeau, F., "UTF-16, an encoding of ISO 10646", + RFC 2781, February 2000. + + [XML] Tim Bray, Jean Paoli, C. M. Sperber-McQueen, "Extensible + Markup Language (XML) 1.0", http://www.w3.org/TR/1998/REC-xml- + 19980210 + +Authors' Addresses + + Hiroshi Maruyama, + IBM Research, Tokyo Research Laboratory + + EMail: maruyama@jp.ibm.com + + + Kent Tamura, + IBM Research, Tokyo Research Laboratory + + EMail: kent@trl.ibm.co.jp + + + Naohiko Uramoto, + IBM Research, Tokyo Research Laboratory + + EMail: uramoto@jp.ibm.com + + + + +Maruyama, et al. Informational [Page 10] + +RFC 2803 Digest Values for DOM (DOMHASH) April 2000 + + +Full Copyright Statement + + Copyright (C) The Internet Society (2000). All Rights Reserved. + + This document and translations of it may be copied and furnished to + others, and derivative works that comment on or otherwise explain it + or assist in its implementation may be prepared, copied, published + and distributed, in whole or in part, without restriction of any + kind, provided that the above copyright notice and this paragraph are + included on all such copies and derivative works. However, this + document itself may not be modified in any way, such as by removing + the copyright notice or references to the Internet Society or other + Internet organizations, except as needed for the purpose of + developing Internet standards in which case the procedures for + copyrights defined in the Internet Standards process must be + followed, or as required to translate it into languages other than + English. + + The limited permissions granted above are perpetual and will not be + revoked by the Internet Society or its successors or assigns. + + This document and the information contained herein is provided on an + "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING + TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING + BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION + HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF + MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. + +Acknowledgment + + Funding for the RFC Editor function is currently provided by the + Internet Society. + + + + + + + + + + + + + + + + + + + +Maruyama, et al. Informational [Page 11] + |