summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc2803.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/rfc/rfc2803.txt')
-rw-r--r--doc/rfc/rfc2803.txt619
1 files changed, 619 insertions, 0 deletions
diff --git a/doc/rfc/rfc2803.txt b/doc/rfc/rfc2803.txt
new file mode 100644
index 0000000..3d1df4a
--- /dev/null
+++ b/doc/rfc/rfc2803.txt
@@ -0,0 +1,619 @@
+
+
+
+
+
+
+Network Working Group H. Maruyama
+Request for Comments: 2803 K. Tamura
+Category: Informational N. Uramoto
+ IBM
+ April 2000
+
+
+ Digest Values for DOM (DOMHASH)
+
+Status of this Memo
+
+ This memo provides information for the Internet community. It does
+ not specify an Internet standard of any kind. Distribution of this
+ memo is unlimited.
+
+Copyright Notice
+
+ Copyright (C) The Internet Society (2000). All Rights Reserved.
+
+Abstract
+
+ This memo defines a clear and unambiguous definition of digest (hash)
+ values of the XML objects regardless of the surface string variation
+ of XML. This definition can be used for XML digital signature as well
+ efficient replication of XML objects.
+
+Table of Contents
+
+ 1. Introduction............................................2
+ 2. Digest Calculation......................................3
+ 2.1. Overview..............................................3
+ 2.2. Namespace Considerations..............................4
+ 2.3. Definition with Code Fragments........................5
+ 2.3.1. Text Nodes..........................................5
+ 2.3.2. Processing Instruction Nodes........................6
+ 2.3.3. Attr Nodes..........................................6
+ 2.3.4. Element Nodes.......................................7
+ 2.3.5. Document Nodes......................................9
+ 3. Discussion..............................................9
+ 4. Security Considerations.................................9
+ References................................................10
+ Authors' Addresses........................................10
+ Full Copyright Statement..................................11
+
+
+
+
+
+
+
+
+Maruyama, et al. Informational [Page 1]
+
+RFC 2803 Digest Values for DOM (DOMHASH) April 2000
+
+
+1. Introduction
+
+ The purpose of this document is to give a clear and unambiguous
+ definition of digest (hash) values of the XML objects [XML]. Two
+ subtrees are considered identical if their hash values are the same,
+ and different if their hash values are different.
+
+ There are at least two usage scenarios of DOMHASH. One is as a basis
+ for digital signatures for XML. Digital signature algorithms normally
+ require hashing a signed content before signing. DOMHASH provides a
+ concrete definition of the hash value calculation.
+
+ The other is to use DOMHASH when synchronizing two DOM structures
+ [DOM]. Suppose that a server program generates a DOM structure which
+ is to be rendered by clients. If the server makes frequent small
+ changes on a large DOM tree, it is desirable that only the modified
+ parts are sent over to the client. A client can initiate a request by
+ sending the root hash value of the structure in the cache memory. If
+ it matches with the root hash value of the current server structure,
+ nothing needs be sent. If not, then the server compares the client
+ hash with the older versions in the server's cache. If it finds one
+ that matches the client's version of the structure, then it locates
+ differences with the current version by recursively comparing the
+ hash values of each node. This way, the client can receive only an
+ updated portion of a large structure without requesting the whole
+ thing.
+
+ One way of defining digest values is to take a surface string as the
+ input for a digest algorithm. However, this approach has several
+ drawbacks. The same internal DOM structure may be represented in may
+ different ways as surface strings even if they strictly conform to
+ the XML specification. Treatment of white spaces, selection of
+ character encodings, entity references (i.e., use of ampersands), and
+ so on have impact on the generation of a surface string. If the
+ implementations of surface string generation are different, the hash
+ values would be different, resulting in unvalidatable digital
+ signatures and unsuccessful detection of identical DOM structures.
+ Therefore, it is desirable that digest of DOM is defined in the DOM
+ terms -- that is, as an unambiguous algorithm operating on a DOM
+ tree. This is the approach we take in this specification.
+
+ Introduction of namespace is another source of variation of surface
+ string because different namespace prefixes can be used for
+ representing the same namespace URI [URI]. In the following example,
+ the namespace prefix "edi" is bound to the URI
+ "http://ecommerce.org/schema" but this prefix can be arbitrary chosen
+ without changing the logical contents as shown in the second example.
+
+
+
+
+Maruyama, et al. Informational [Page 2]
+
+RFC 2803 Digest Values for DOM (DOMHASH) April 2000
+
+
+ <?xml version="1.0"?>
+ <root xmlns:edi='http://ecommerce.org/schema'>
+ <edi:order>
+ :
+ </edi:order>
+ </root>
+
+
+ <?xml version="1.0"?>
+ <root xmlns:ec='http://ecommerce.org/schema'>
+ <ec:order>
+ :
+ </ec:order>
+ </root>
+
+ The DOMHASH defined in this document is designed so that the choice
+ of the namespace prefix does not affect the digest value. In the
+ above example, both the "root" elements will get the same digest
+ value.
+
+2. Digest Calculation
+
+2.1. Overview
+
+ Hash values are defined on the DOM type Node. We consider the
+ following five node types that are used for representing a DOM
+ document structure:
+
+ - Text
+ - ProcessingInstruction
+ - Attr
+ - Element
+ - Document
+
+ Comment nodes and Document Type Definitions (DTDs) do not participate
+ in the digest value calculation. This is because DOM does not
+ require a conformant processor to create data structures for these.
+ DOMHASH is designed so that it can be computed with any XML processor
+ conformant to the DOM or SAX [SAX] specification.
+
+ Nodes with the node type EntityReference must be expanded prior to
+ digest calculation.
+
+ The digest values are defined recursively on each level of the DOM
+ tree so that only a relevant part needs to be recalculated when a
+ small portion of the tree is changed.
+
+
+
+
+
+Maruyama, et al. Informational [Page 3]
+
+RFC 2803 Digest Values for DOM (DOMHASH) April 2000
+
+
+ Below, we give the precise definitions of digest for these types. We
+ describe the format of the data to be supplied to a hash algorithm
+ using a figure and a simple description, followed by a Java code
+ fragment using the DOM API and the JDK 1.1 Platform Core API only.
+ Therefore, the semantics should be unambiguous.
+
+ As the rule of thumb, all strings are to be in UTF-16BE [UTF16]. If
+ there is a sequence of text nodes without any element nodes in
+ between, these text nodes are merged into one by concatenating them.
+ A zero-length text node is always ignored.
+
+ Note that validating and non-validating XML processors may generate
+ different DOM trees from the same XML document, due to attribute
+ normalization and default attributes. If DOMHASH is to be used for
+ testing logical equivalence between two XML documents (as opposed to
+ DOM trees), it may be necessary to normalize attributes and supply
+ default attributes prior to DOMHASH calculation.
+
+ Some legacy character encodings (such as ISO-2022-JP) have certain
+ ambiguity in translating into Unicode. This is again dependent on
+ XML processors. Treatment of such processor dependencies is out of
+ scope of this document.
+
+2.2. Namespace Considerations
+
+ To avoid the dependence on the namespace prefix, we use "expanded
+ names" to do digest calculation. If an element name or an attribute
+ name is qualified either by a explicit namespace prefix or by a
+ default namespace, the name's LocalPart is prepended by the URI of
+ the namespace (the namespace name as defined in the Namespace
+ specification [NAM]) and a colon before digest calculation. In the
+ following example, the default qualified name "order" is expanded
+ into "http://ecommerce.org/schema:order" while the explicit qualified
+ name "book:title" is expanded into "urn:loc.gov:books:title" before
+ digest calculation.
+
+ <?xml version="1.0"?>
+
+ <root xmlns='http://ecommerce.org/schema'
+ xmlns:book='urn:loc.gov:books'>
+ <order>
+ <book:title> ... </book:title>
+ :
+ </order>
+ </root>
+
+
+
+
+
+
+Maruyama, et al. Informational [Page 4]
+
+RFC 2803 Digest Values for DOM (DOMHASH) April 2000
+
+
+ We define an expanded name (either for element or attribute) as
+ follows:
+
+ If a name is not qualified, the expanded name is the name itself.
+
+ If a name is qualified with the prefix "xmlns", the expanded name
+ is undefined.
+
+ If a name is qualified either by default or by an explicit
+ namespace prefix, the expanded name is URI bound to the namespace
+ + ":" + LocalPart
+
+ In the following example code, we assume that the getExpandedName()
+ method (which returns the expanded name as defined above) is defined
+ in both Element and Attr interfaces of DOM.
+
+ Note that the digest values are not defined on namespace
+ declarations. In other words, the digest value is not defined for an
+ attribute when
+
+ - the attribute name is "xmlns", or
+ - the namespace prefix is "xmlns".
+
+ In the above example, the two attributes which are namespace
+ declarations do not have digest values and therefore will not
+ participate in the calculation of the digest value of the "root"
+ element.
+
+2.3. Definition with Code Fragments
+
+ The code fragments in the definitions below assume that they are in
+ implementation classes of Node. Therefore, a methods call without an
+ explicit object reference is for the Node itself. For example,
+ getData() returns the text data of the current node if it is a Text
+ node. The parameter digestAlgorithm is to be replaced by an
+ identifier of the digest algorithm, such as "MD5" [MD5] and "SHA-1"
+ [SHA].
+
+ The computation should begin with a four byte integer that represents
+ the type of the node, such as TEXT_NODE or ELEMENT_NODE.
+
+2.3.1. Text Nodes
+
+ The hash value of a Text node is computed on the four byte header
+ followed by the UTF-16BE encoded text string.
+
+ - TEXT_NODE (3) in 32 bit network-byte-ordered integer
+ - Text data in UTF-16BE stream (variable length)
+
+
+
+Maruyama, et al. Informational [Page 5]
+
+RFC 2803 Digest Values for DOM (DOMHASH) April 2000
+
+
+ public byte[] getDigest(String digestAlgorithm) {
+ MessageDigest md = MessageDigest.getInstance(digestAlgorithm);
+ md.update((byte)0);
+ md.update((byte)0);
+ md.update((byte)0);
+ md.update((byte)3);
+ md.update(getData().getBytes("UnicodeBigUnmarked"));
+ return md.digest();
+ }
+
+ Here, MessageDigest is in the package java.security.*, one of the
+ built-in packages of JDK 1.1.
+
+2.3.2. ProcessingInstruction Nodes
+
+ A ProcessingInstruction (PI) node has two components: the target and
+ the data. Accordingly, the hash is computed on the concatenation of
+ both, separated by 'x0000'. PI data is from the first non white
+ space character after the target to the character immediately
+ preceding the "?>".
+
+ - PROCESSING_INSTRUCTION_NODE (7) in 32 bit network-byte-ordered
+ integer
+ - PI target in UTF-16BE stream (variable length)
+ - 0x00 0x00
+ - PI data in UTF-16BE stream (variable length)
+
+ public byte[] getDigest(String digestAlgorithm) {
+ MessageDigest md = MessageDigest.getInstance(digestAlgorithm);
+ md.update((byte)0);
+ md.update((byte)0);
+ md.update((byte)0);
+ md.update((byte)7);
+ md.update(getName().getBytes("UnicodeBigUnmarked"));
+ md.update((byte)0);
+ md.update((byte)0);
+ md.update(getData().getBytes("UnicodeBigUnmarked"));
+ return md.digest();
+ }
+
+2.3.3. Attr Nodes
+
+ The digest value of Attr nodes are defined similarly to PI nodes,
+ except that we need a separator between the expanded attribute name
+ and the attribute value. The '0x0000' value in UTF-16BE is allowed
+ nowhere in an XML document, so it can serve as an unambiguous
+ separator. The expanded name must be used as the attribute name
+ because it may be qualified. Note that if the attribute is a
+
+
+
+Maruyama, et al. Informational [Page 6]
+
+RFC 2803 Digest Values for DOM (DOMHASH) April 2000
+
+
+ namespace declaration (either the attribute name is "xmlns" or its
+ prefix is "xmlns"), the digest value is undefined and the getDigest()
+ method should return null.
+
+ - ATTRIBUTE_NODE (2) in 32 bit network-byte-ordered integer
+ - Expanded attribute name in UTF-16BE stream (variable length)
+ - 0x00 0x00
+ - Attribute value in UTF-16BE stream (variable length)
+
+ public byte[] getDigest(String digestAlgorithm) {
+ if (getNodeName().equals("xmlns")
+ || getNodeName().startsWith("xmlns:"))
+ return null;
+ MessageDigest md = MessageDigest.getInstance(digestAlgorithm);
+ md.update((byte)0);
+ md.update((byte)0);
+ md.update((byte)0);
+ md.update((byte)2);
+ md.update(getExpandedName().getBytes("UnicodeBigUnmarked"));
+ md.update((byte)0);
+ md.update((byte)0);
+ md.update(getValue().getBytes("UnicodeBigUnmarked"));
+ return md.digest();
+ }
+
+2.3.4. Element Nodes
+
+ Element nodes are the most complex because they consist of other
+ nodes recursively. Hash values of these component nodes are used to
+ calculate the node's digest so that we can save computation when the
+ structure is partially changed.
+
+ First, all the attributes except for namespace declarations must be
+ collected. This list is sorted lexicographically by the expanded
+ attribute names (based on Unicode character code points). When no
+ surrogate characters are involved, this is the same as sorting in
+ ascending order in terms of the UTF-16BE encoded expanded attribute
+ names, using the string comparison operator String.compareTo() in
+ Java.
+
+ - ELEMENT_NODE (1) in 32 bit network-byte-ordered integer
+ - Expanded element name in UTF-16BE stream (variable length)
+ - 0x00 0x00
+ - A number of non-namespace-declaration attributes in 32 bit
+ network-byte-ordered unsigned integer
+ - Sequence of digest values of non-namespace-declaration attributes,
+ sorted lexicographically by expanded attribute names
+ - A number of child nodes (except for Comment nodes) in 32bit
+
+
+
+Maruyama, et al. Informational [Page 7]
+
+RFC 2803 Digest Values for DOM (DOMHASH) April 2000
+
+
+ network-byte-ordered unsigned integer
+ - Sequence of digest values of each child node except for Comment
+ nodes (variable length) (A sequence of child texts is merged to one
+ text. A zero-length text and Comment nodes are not counted as
+ child)
+
+ public byte[] getDigest(String digestAlgorithm) {
+ MessageDigest md = MessageDigest.getInstance(digestAlgorithm);
+ ByteArrayOutputStream baos = new ByteArrayOutputStream();
+ DataOutputStream dos = new DataOutputStream(baos);
+ dos.writeInt(ELEMENT_NODE);//This is stored in network byte order
+ dos.write(getExpandedName().getBytes("UnicodeBigUnmarked"));
+ dos.write((byte)0);
+ dos.write((byte)0);
+ // Collect all attributes except for namespace declarations
+ NamedNodeMap nnm = this.getAttributes();
+ int len = nnm.getLength()
+ // Find "xmlns" or "xmlns:foo" in nnm and omit it.
+ ...
+ dos.writeInt(len); // This is sorted in the network byte order
+ // Sort attributes lexicographically by expanded attribute
+ // names.
+ ...
+ // Assume that `Attr[] aattr' has sorted Attribute instances.
+ for (int i = 0; i < len; i ++)
+ dos.write(aattr[i].getDigest(digestAlgorithm));
+ Node n = this.getFirstChild();
+ // Assume that adjoining Texts are merged,
+ // there is no 0-length Text, and
+ // comment nodes are removed.
+ len = this.getChildNodes().getLength();
+ dos.writeInt(len); // This is stored in the network byte order
+ while (n != null) {
+ dos.write(n.getDigest(digestAlgorithm));
+ n = n.getNextSibling();
+ }
+ dos.close();
+ md.update(baos.toByteArray());
+ return md.digest();
+ }
+
+
+
+
+
+
+
+
+
+
+
+Maruyama, et al. Informational [Page 8]
+
+RFC 2803 Digest Values for DOM (DOMHASH) April 2000
+
+
+2.3.5. Document Nodes
+
+ A Document node may have PI nodes before and after the root Element
+ node. The digest value of a Document node is computed based on the
+ sequence of the digest values of the pre-root PI nodes, the root
+ Element node, and the post-root PI nodes in this order. Comment
+ nodes and DocumentType nodes, if any, are ignored.
+
+ - DOCUMENT_NODE (9) in 32 bit network-byte-ordered integer
+ - A number of child nodes (except for Comment and DocumentType nodes)
+ in 32bit network-byte-ordered unsigned integer
+ - Sequence of digest values of each child node except for Comment and
+ DocumentType nodes (variable length)
+
+ public byte[] getDigest(String digestAlgorithm) {
+ MessageDigest md = MessageDigest.getInstance(digestAlgorithm);
+ ByteArrayOutputStream baos = new ByteArrayOutputStream();
+ DataOutputStream dos = new DataOutputStream(baos);
+ dos.writeInt(DOCUMENT_NODE);//This is stored in network byte order
+
+ // Assume that Comment and DocumentType nodes are removed and this
+ // node has only an Element node and PI nodes.
+ len = this.getChildNodes().getLength();
+ dos.writeInt(len); // This is stored in the network byte order
+ Node n = this.getFirstChild();
+ while (n != null) {
+ dos.write(n.getDigest(digestAlgorithm));
+ n = n.getNextSibling();
+ }
+ dos.close();
+ md.update(baos.toByteArray());
+ return md.digest();
+ }
+
+3. Discussion
+
+ The definition described above can be efficiently implemented with
+ any XML processor that is conformant to either DOM and SAX
+ specification. Reference implementations are available on request.
+
+4. Security Considerations
+
+ DOMHASH is expected to be used as the basis for digital signatures
+ and other security and integrity uses. It's appropriateness for
+ such uses depends on the security of the hash algorithm used and
+ inclusion of the fundamental characteristics it is desired to check
+ in parts of the DOM model incorporated in the digest by DOMHASH.
+
+
+
+
+Maruyama, et al. Informational [Page 9]
+
+RFC 2803 Digest Values for DOM (DOMHASH) April 2000
+
+
+References
+
+ [DOM] "Document Object Model (DOM), Level 1 Specification", October
+ 1998, http://www.w3.org/TR/REC-DOM-Level-1/
+
+ [MD5] Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321,
+ April 1992.
+
+ [NAM] Tim Bray, Dave Hollander, Andrew Layman, "Namespaces in XML",
+ http://www.w3.org/TR/1999/REC-xml-names-19990114.
+
+ [SAX] David Megginson, "SAX 1.0: The Simple API for XML",
+ http://www.megginson.com/SAX/, May 1998.
+
+ [SHA] (US) National Institute of Standards and Technology, "Federal
+ Information Processing Standards Publication 180-1: Secure Hash
+ Standard", 17 April 1995.
+
+ [URI] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform
+ Resource Identifiers (URI): Generic Syntax", RFC 2396, August
+ 1998.
+
+ [UTF16] Hoffman, P., Yergeau, F., "UTF-16, an encoding of ISO 10646",
+ RFC 2781, February 2000.
+
+ [XML] Tim Bray, Jean Paoli, C. M. Sperber-McQueen, "Extensible
+ Markup Language (XML) 1.0", http://www.w3.org/TR/1998/REC-xml-
+ 19980210
+
+Authors' Addresses
+
+ Hiroshi Maruyama,
+ IBM Research, Tokyo Research Laboratory
+
+ EMail: maruyama@jp.ibm.com
+
+
+ Kent Tamura,
+ IBM Research, Tokyo Research Laboratory
+
+ EMail: kent@trl.ibm.co.jp
+
+
+ Naohiko Uramoto,
+ IBM Research, Tokyo Research Laboratory
+
+ EMail: uramoto@jp.ibm.com
+
+
+
+
+Maruyama, et al. Informational [Page 10]
+
+RFC 2803 Digest Values for DOM (DOMHASH) April 2000
+
+
+Full Copyright Statement
+
+ Copyright (C) The Internet Society (2000). All Rights Reserved.
+
+ This document and translations of it may be copied and furnished to
+ others, and derivative works that comment on or otherwise explain it
+ or assist in its implementation may be prepared, copied, published
+ and distributed, in whole or in part, without restriction of any
+ kind, provided that the above copyright notice and this paragraph are
+ included on all such copies and derivative works. However, this
+ document itself may not be modified in any way, such as by removing
+ the copyright notice or references to the Internet Society or other
+ Internet organizations, except as needed for the purpose of
+ developing Internet standards in which case the procedures for
+ copyrights defined in the Internet Standards process must be
+ followed, or as required to translate it into languages other than
+ English.
+
+ The limited permissions granted above are perpetual and will not be
+ revoked by the Internet Society or its successors or assigns.
+
+ This document and the information contained herein is provided on an
+ "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
+ TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
+ BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
+ HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
+ MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+
+Acknowledgment
+
+ Funding for the RFC Editor function is currently provided by the
+ Internet Society.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Maruyama, et al. Informational [Page 11]
+