summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc1691.txt
diff options
context:
space:
mode:
authorThomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
committerThomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
commit4bfd864f10b68b71482b35c818559068ef8d5797 (patch)
treee3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc1691.txt
parentea76e11061bda059ae9f9ad130a9895cc85607db (diff)
doc: Add RFC documents
Diffstat (limited to 'doc/rfc/rfc1691.txt')
-rw-r--r--doc/rfc/rfc1691.txt563
1 files changed, 563 insertions, 0 deletions
diff --git a/doc/rfc/rfc1691.txt b/doc/rfc/rfc1691.txt
new file mode 100644
index 0000000..b4f7343
--- /dev/null
+++ b/doc/rfc/rfc1691.txt
@@ -0,0 +1,563 @@
+
+
+
+
+
+
+Network Working Group W. Turner
+Request for Comments: 1691 LTD
+Category: Informational August 1994
+
+
+ The Document Architecture for the Cornell Digital Library
+
+Status of this Memo
+
+ This memo provides information for the Internet community. This memo
+ does not specify an Internet standard of any kind. Distribution of
+ this memo is unlimited.
+
+Abstract
+
+ This memo defines an architecture for the storage and retrieval of
+ the digital representations for books, journals, photographic images,
+ etc., which are collected in a large organized digital library.
+
+ Two unique features of this architecture are the ability to generate
+ reference documents and the ability to create multiple views of a
+ document.
+
+Introduction
+
+ In 1989, Cornell University and Xerox Corporation, with support from
+ the Commission on Preservation and Access and later Sun Microsystems,
+ embarked on a collaborative project to study and to prototype the
+ application of digital technologies for the preservation of library
+ material. During this project, Xerox developed the College Library
+ Access and Storage System (CLASS), and Cornell developed software to
+ provide network access to the CLASS Digital Library.
+
+ Xerox and Cornell University Library staff worked closely together to
+ define requirements for storing both low- and high-resolution
+ versions of images, so that the low-resolution images could be used
+ for browsing over the network and the high-resolution images could be
+ used for printing. In addition, substantial work was done to define
+ documents with internal structures that could be navigated. Xerox
+ developed the software to create and store documents, while Cornell
+ developed complementary software to allow library users to browse the
+ documents and request printed copies over the network.
+
+ Cornell has defined a document architecture which builds on the
+ lessons learned in the CLASS project, and is maintaining digital
+ library materials in that form.
+
+
+
+
+
+Turner [Page 1]
+
+RFC 1691 CDL Document Architecture August 1994
+
+
+Document Architecture Overview
+
+ Just as a conventional library contains books rather than pages, so
+ the electronic library must contain documents rather than images.
+ During the scanning process, images are automatically linked into
+ documents by creating document structure files which order the image
+ files in the same way the binding of a book orders the pages. Thus,
+ the digital book as currently configured consists of two parts: a set
+ of individual pages stored as discrete bit map image files, and the
+ document structure files which "bind" the image files into a
+ document. In addition, a database entry is made for each digital
+ document which permits searching by author and title (i.e.,
+ bibliographic information). Beyond the order of the pages, the
+ arrangement of a physical book provides information to readers. The
+ title page and publication information come first; the table of
+ contents usually precedes the text; the text is divided into sections
+ or chapters; if there is an index, it follows the text. The reader
+ often refers to these components of a book when browsing the library
+ shelves, in order to determine whether to read the book.
+
+ The document structure provides direct access to the components of an
+ electronic document, storing the information that would otherwise be
+ lost when the book is disbound for scanning.
+
+Document Architecture Requirements
+
+ Listed below are the requirements that were initially set down for
+ the Cornell Digital Library Architecture.
+
+ 1. The architecture must be open (i.e., published and freely
+ available).
+
+ 2. The architecture should be as simple as possible (to facilitate
+ product development).
+
+ 3. The architecture should assume data storage in UNIX file systems.
+
+ 4. The architecture should allow for standard data usage, such as via
+ FTP and Gopher servers (i.e., pages of a document must exist in a
+ single directory, and the naming convention used must order them
+ in the standard collating sequence, such as the series "0001.TIF,
+ 0002.TIF,..., 0411.TIF" (NOTE: a series such as "1.TIF, 2.TIF,...,
+ 10.TIF" would be ordered "1.TIF, 10.TIF, 2.TIF, ..." which is not
+ acceptable).
+
+ 5. The architecture should provide for storing the same information
+ in different formats. For example, when a page of a document is
+ available at several different resolutions.
+
+
+
+Turner [Page 2]
+
+RFC 1691 CDL Document Architecture August 1994
+
+
+ 6. Low-resolution "thumbnail" images of each page must be stored to
+ facilitate browsing and sharing of data.
+
+ 7. The architecture must support distribution of files so that
+ similar files may be stored together, permitting optimization of
+ storage use and performance.
+
+ 8. The architecture must support documents that are composed of
+ references to all or part of other documents.
+
+ 9. The architecture must support document components which are
+ stored on separate servers distributed across the network.
+
+ 10. The architecture must support not only an hierarchical structure
+ for each document, but the ability to define multiple views of
+ each document.
+
+ 11. The architecture should accept, rather than dictate, directory
+ structures in which documents will be stored. This will permit
+ documents created in other ways to be added to the Digital
+ Library simply by adding database information rather than by
+ copying or moving files.
+
+Document Architecture Description
+
+ A digital library consists of a Digital Library Server, networked
+ storage, and a referencing database. A single digital library will
+ contain one or more collections. Each collection will contain one or
+ more documents.
+
+ The referencing database allows searching for documents by author,
+ title, and document ID. In the current implementation, the
+ referencing database is a relational SQL database, and each
+ collection is epresented by a table in the database. It is planned
+ to migrate to Z39.50 database searching as the preferred method, as
+ this protocol has been established as the standard for library
+ applications.
+
+ Authorization will be primarily collection-based, although the design
+ will permit authorization checking at any level down to the
+ individual file. Notification would come only when the patron
+ attempted to open the document or access the particular component.
+
+ Each document consists of three components: the logical structure;
+ the physical references; and the data files.
+
+
+
+
+
+
+Turner [Page 3]
+
+RFC 1691 CDL Document Architecture August 1994
+
+
+ The logical structure is a logical description of the document.
+ Conceptually, a document is a tree, with the leaves being the data
+ files (pages). At a minimum, all documents have a logical structure
+ which lists the pages in the document and the order in which they
+ appear. Usually, documents will have a more elaborate structure.
+ The logical structure relates the logical structure of a document to
+ the physical references which make up the document.
+
+ These physical references map the lowest levels of the document's
+ logical structure (the leaves of the tree) to the files that contain
+ the data. Where there are multiple representations of a page, such
+ as images at various resolutions, these are linked together in the
+ physical references file.
+
+ The data files contain the data making up a document. Any format can
+ be accommodated: image files, ASCII text, PostScript, etc. However,
+ one-to-one correspondence between data files for a given physical
+ reference is assumed. That is, if there are multiple file types for
+ a single page, these files should represent exactly the same
+ information.
+
+Physical References File
+
+ The Physical References file is the component of the document which
+ relates logical structures (logical components of documents) to
+ physical files. Document references, by which a document can be
+ composed of all or part of other documents possibly residing on
+ different servers, are handled in the Physical References file.
+
+ A document may contain multiple document objects, each of which
+ contains one or more data objects. When a document contains actual
+ physical data (for example, it is created by scanning or importing
+ images), a Master Document Object is created. When a document
+ incorporates components of other documents, a Reference Document
+ Object is created for each of the other documents. The Document
+ Objects are numbered with internal reference numbers, which are
+ included in the corresponding Data Object lines.
+
+ Data Object lines include the Document Object number, the file
+ reference number, and the file type. The Document Object number
+ refers to a Document Object line, from which the library name,
+ collection name, and document ID can be retrieved. The tuple
+
+ <libraryID>+<collectionID>+<documentID>+<filetype>+<file reference>
+
+ is guaranteed to locate a file. Each Data Object line refers to a
+ single file; where multiple file types of a single document page
+ exist, there will be multiple Data Object lines for that page.
+
+
+
+Turner [Page 4]
+
+RFC 1691 CDL Document Architecture August 1994
+
+
+ In the file, all Document Object lines will preceed all Data Object
+ lines for a given document. Document Object lines may be either
+ grouped together at the beginning of the file, or may immediately
+ preceed the first Data Object line for the Document Object. Document
+ Object lines will appear in order by Document Object number. Data
+ Object lines will appear in order by sequence number, NOT by Document
+ Object number.
+
+ The fields in the Physical References file are delimited by vertical
+ bars.
+
+Document Object Lines
+
+ Field Description Comments
+ ----- ---------------------- ----------------------------
+ 1 Document Object number 0 => Master Document Object
+ 1-9 => Reference Document Object
+ 2 Library name Server name
+ 3 Collection name
+ 4 Document ID 8-digit number
+ 5 Author name
+ 6 Volume
+ 7 Title
+ 8 Edition
+
+Data Object Lines
+
+ Field Description Comments
+ ----- ---------------------- ----------------------------
+ 1 Document Object number Corresponds to above
+ 2 Sequence number
+ 3 File reference Reference number used to locate
+ file in filing system
+ 4 Physical reference number Equal to Logical Structure file
+ 5 File type 1 = TIFF 600dpi
+ 2 = TIFF thumbnail
+ 3 = ASCII version of page
+ (i.e., OCR output)
+ 4 = ASCII notes
+ 5 = Other
+ 6 = TIFF 300dpi
+ 6 Note
+
+
+
+
+
+
+
+
+
+Turner [Page 5]
+
+RFC 1691 CDL Document Architecture August 1994
+
+
+Physical References File Example
+
++0|CORNELL|OLINLIB|00000001|Boole, Mary Everest||Philosophy Of Algebra||
+
+|0|1|00000002|5|1|| (File ref. #2 = Phys. ref. #5 = 600dpi TIFF image)
+|0|2|00000003|5|2|| (File ref. #3 = Phys. ref. #5 = 100dpi TIFF image)
+|0|3|00000004|6|1|| (File ref. #4 = Phys. ref. #6 = 600dpi TIFF image)
+|0|4|00000005|6|2|| (File ref. #5 = Phys. ref. #6 = 100dpi TIFF image)
+
+ Note that in the above, it is guaranteed that file references 2 and 3
+ are two different versions of the same page, as are file references 4
+ and 5.
+
+Logical Structure File
+
+ The Logical Structure file is the component of the document structure
+ which offers "views" of a document and links images together
+ logically to define documents. The file is actually an unloaded tree;
+ when a document is "opened", the file is read and the tree
+ reconstructed. By convention, all Logical Structure files contain one
+ logical structure "PAGES" which defines the document by listing the
+ pages in the order in which they appeared in the original document.
+
+Document Structure lines
+
+ Field Description Comments
+ ----- ---------------------- ----------------------------
+ 1 Parent structure number Structure is a child of...
+ 2 Sequence number
+ 3 Logical Structure name Label for this structure
+ 4 Structure number Equal to Physical Reference file
+ 5 Logical Children # of logical children of this
+ structure
+Document Structure lines (continued)
+
+ Field Description Comments
+ ----- ---------------------- ----------------------------
+ 6 Physical Children # of physical children of this
+ structure
+ 7 References # of references to this
+ structure within this document
+ (for how many structures is this
+ a substructure)
+
+
+
+
+
+
+
+
+Turner [Page 6]
+
+RFC 1691 CDL Document Architecture August 1994
+
+
+Logical Structure File Example
+
+|0|0|ROOT|0|4|0|0| Structure 0, ROOT, has 4 logical children
+|0|1|PAGES|1|100|0|1| Str. 1, PAGES, has 100 logical children
+|0|2|CONTENTS|2|22|0|1| Str. 2, CONTENTS, has 22 logical children
+ ...has no physical children
+ ...
+|1|1|Production note|5|0|2|2| Str. 5 is child of structure 1
+ ...has a label "Production note"
+ ...has no logical children
+ ...has 2 physical references
+ ...is referenced twice in this document
+|1|2||6|0|2|1| Str. 6 has no label
+|1|3||7|0|2|1| Str. 7 has 2 physical references
+|1|4||8|0|2|1| Str. 8 is referenced only here
+|1|5||9|0|2|1| Str. 9 is 5th sequential child of PAGES
+ ...
+|1|99||103|0|2|2|
+|1|100||104|0|2|2|
+|2|1|Production note|105|1|0|1| Str. 105 is a child of str. 2
+|2|2|Title page|106|1|0|1| Str. 106 has 1 logical child
+|2|3|Table of contents|107|2|0|1|
+|2|4|Chapter 1. From Arithmetic to Algebra|108|6|0|1|
+|2|5|Chapter 2. The Making of Algebras|109|4|0|1|
+|2|6|Chapter 3. Simultaneous Problems|110|4|0|1|
+|2|7|Chapter 4. Partial Solutions...|111|3|0|1|
+|2|8|Chapter 5. Mathematical Certainty...|112|3|0|1|
+|2|9|Chapter 6. The First Hebrew Algebra|113|8|0|1|
+|2|10|Chapter 7. How to Choose our Hypotheses|114|9|0|1|
+|2|11|Chapter 8. The Limits of the Teachers Function|115|5|0|1|
+|2|12|Chapter 9. The Use of Sewing Cards|116|4|0|1|
+ ...
+|2|20|Chapter 17. From Bondage to Freedom|124|5|0|1|
+|2|21|Appendix|125|2|1|1|
+|2|22|advertisements|126|4|1|2|
+|105|1|Production note|5|0|2|2| Str. 5 is a child of str. 105
+|106|1|Title page|11|0|2|2| 2nd reference to str. 11
+|107|1|7|15|0|2|2|
+|107|2|8|16|0|2|2|
+ ...
+|126|4||104|0|2|2|
+
+
+
+
+
+
+
+
+
+
+Turner [Page 7]
+
+RFC 1691 CDL Document Architecture August 1994
+
+
+Implementation Details
+
+ The tuple <library ID>+<collection ID>+<document ID>+<filetype>+
+ <file reference> is guaranteed to locate a file. A file locator
+ program will translate between this tuple and the fully-qualified
+ path and file name in the underlying file system. While a library
+ will always have a hierarchical nature corresponding to UNIX file
+ systems, the order of the hierarchy will be flexible to accommodate
+ optimization efforts. Each level of the hierarchy will have an INFO
+ file that describes the order of the lower levels of the hierarchy.
+ The file locator program will read these files as it navigates the
+ directory structure of the file system when a library, collection, or
+ document is opened. Two examples follow:
+
+ Example 1. Hierarchy is LIBRARY, COLLECTION, DOCUMENT, FILETYPE.
+
+ /<library name>
+ LIBINFO.TXT Description of library
+ /<collection name>
+ COLINFO.TXT Description of collection
+ /<document ID>
+ DOCINFO.TXT Description of document
+ LOGSTR.000 Logical structure file
+ PHYSREF.000 Physical reference file
+ /<filetype1>
+ 00001.TIF
+ 00002.TIF
+ ...
+ /<filetype2>
+ 00001.TIF
+ 00002.TIF
+ ...
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Turner [Page 8]
+
+RFC 1691 CDL Document Architecture August 1994
+
+
+ Example 2. Hierarchy is LIBRARY, FILETYPE, COLLECTION, DOCUMENT.
+
+ /<library name>
+
+ LIBINFO.TXT Description of library
+ /<filetype1>
+ /<collection name>
+ COLINFO.TXT Description of collection
+ /<document ID>
+ DOCINFO.TXT Description of document
+ LOGSTR.000 Logical structure file
+ PHYSREF.000 Physical reference file
+ 00001.TIF
+ 00002.TIF
+ ...
+ /<filetype2>
+ /<collection name>
+ COLINFO.TXT Description of collection
+ /<document ID>
+ DOCINFO.TXT Description of document
+ LOGSTR.000 Logical structure file
+ PHYSREF.000 Physical reference file
+ 00001.TIF
+ 00002.TIF
+ ....
+
+ This implementation involves some redundancy, but it permits complete
+ copies of a collection to be mounted on different file systems for
+ performance considerations. In particular, the second scheme would
+ facilitate storing all low-resolution images on high-speed magnetic
+ disk for fast access, and all high-resolution images on slower, less
+ expensive storage. This will also facilitate authorizing access to
+ low-resolution images by other software systems (FTP, Gopher) while
+ restricting access to high-resolution images.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Turner [Page 9]
+
+RFC 1691 CDL Document Architecture August 1994
+
+
+Security Considerations
+
+ Security issues are not discussed in this memo.
+
+References
+
+ [1] Turner, W., "Cornell Digital Library Document Architecture,
+ Version 1.1 - 3/22/94", Library Technology Department, Cornell
+ University.
+
+Author's Address
+
+ William Turner
+ Library Technology
+ 502 Olin Library
+ Cornell University
+ Ithaca, NY 14853
+
+ Phone: 607-255-9098
+ Fax: 607-255-9346
+ EMail: wrt1@cornell.edu
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Turner [Page 10]
+