GENWiki

Premier IT Outsourcing and Support Services within the UK

User Tools

Site Tools


rfc:rfc1691

Network Working Group W. Turner Request for Comments: 1691 LTD Category: Informational August 1994

     The Document Architecture for the Cornell Digital Library

Status of this Memo

 This memo provides information for the Internet community.  This memo
 does not specify an Internet standard of any kind.  Distribution of
 this memo is unlimited.

Abstract

 This memo defines an architecture for the storage and retrieval of
 the digital representations for books, journals, photographic images,
 etc., which are collected in a large organized digital library.
 Two unique features of this architecture are the ability to generate
 reference documents and the ability to create multiple views of a
 document.

Introduction

 In 1989, Cornell University and Xerox Corporation, with support from
 the Commission on Preservation and Access and later Sun Microsystems,
 embarked on a collaborative project to study and to prototype the
 application of digital technologies for the preservation of library
 material.  During this project, Xerox developed the College Library
 Access and Storage System (CLASS), and Cornell developed software to
 provide network access to the CLASS Digital Library.
 Xerox and Cornell University Library staff worked closely together to
 define requirements for storing both low- and high-resolution
 versions of images, so that the low-resolution images could be used
 for browsing over the network and the high-resolution images could be
 used for printing.  In addition, substantial work was done to define
 documents with internal structures that could be navigated.  Xerox
 developed the software to create and store documents, while Cornell
 developed complementary software to allow library users to browse the
 documents and request printed copies over the network.
 Cornell has defined a document architecture which builds on the
 lessons learned in the CLASS project, and is maintaining digital
 library materials in that form.

Turner [Page 1] RFC 1691 CDL Document Architecture August 1994

Document Architecture Overview

 Just as a conventional library contains books rather than pages, so
 the electronic library must contain documents rather than images.
 During the scanning process, images are automatically linked into
 documents by creating document structure files which order the image
 files in the same way the binding of a book orders the pages.  Thus,
 the digital book as currently configured consists of two parts: a set
 of individual pages stored as discrete bit map image files, and the
 document structure files which "bind" the image files into a
 document.  In addition, a database entry is made for each digital
 document which permits searching by author and title (i.e.,
 bibliographic information).  Beyond the order of the pages, the
 arrangement of a physical book provides information to readers.  The
 title page and publication information come first; the table of
 contents usually precedes the text; the text is divided into sections
 or chapters; if there is an index, it follows the text.  The reader
 often refers to these components of a book when browsing the library
 shelves, in order to determine whether to read the book.
 The document structure provides direct access to the components of an
 electronic document, storing the information that would otherwise be
 lost when the book is disbound for scanning.

Document Architecture Requirements

 Listed below are the requirements that were initially set down for
 the Cornell Digital Library Architecture.
 1. The architecture must be open (i.e., published and freely
    available).
 2. The architecture should be as simple as possible (to facilitate
    product development).
 3. The architecture should assume data storage in UNIX file systems.
 4. The architecture should allow for standard data usage, such as via
    FTP and Gopher servers (i.e., pages of a document must exist in a
    single directory, and the naming convention used must order them
    in the standard collating sequence, such as the series "0001.TIF,
    0002.TIF,..., 0411.TIF" (NOTE: a series such as "1.TIF, 2.TIF,...,
    10.TIF" would be ordered "1.TIF, 10.TIF, 2.TIF, ..." which is not
    acceptable).
 5. The architecture should provide for storing the same information
    in different formats.  For example, when a page of a document is
    available at several different resolutions.

Turner [Page 2] RFC 1691 CDL Document Architecture August 1994

 6. Low-resolution "thumbnail" images of each page must be stored to
    facilitate browsing and sharing of data.
 7. The architecture must support distribution of files so that
    similar files may be stored together, permitting optimization of
    storage use and performance.
 8. The architecture must support documents that are composed of
    references to all or part of other documents.
 9. The architecture must support document components which are
    stored on separate servers distributed across the network.
 10. The architecture must support not only an hierarchical structure
     for each document, but the ability to define multiple views of
     each document.
 11. The architecture should accept, rather than dictate, directory
     structures in which documents will be stored.  This will permit
     documents created in other ways to be added to the Digital
     Library simply by adding database information rather than by
     copying or moving files.

Document Architecture Description

 A digital library consists of a Digital Library Server, networked
 storage, and a referencing database.  A single digital library will
 contain one or more collections.  Each collection will contain one or
 more documents.
 The referencing database allows searching for documents by author,
 title, and document ID.  In the current implementation, the
 referencing database is a relational SQL database, and each
 collection is  epresented by a table in the database.  It is planned
 to migrate to Z39.50 database searching as the preferred method, as
 this protocol has been established as the standard for library
 applications.
 Authorization will be primarily collection-based, although the design
 will permit authorization checking at any level down to the
 individual file.  Notification would come only when the patron
 attempted to open the document or access the particular component.
 Each document consists of three components: the logical structure;
 the physical references; and the data files.

Turner [Page 3] RFC 1691 CDL Document Architecture August 1994

 The logical structure is a logical description of the document.
 Conceptually, a document is a tree, with the leaves being the data
 files (pages).  At a minimum, all documents have a logical structure
 which lists the pages in the document and the order in which they
 appear.  Usually, documents will have a more elaborate structure.
 The logical structure relates the logical structure of a document to
 the physical references which make up the document.
 These physical references map the lowest levels of the document's
 logical structure (the leaves of the tree) to the files that contain
 the data.  Where there are multiple representations of a page, such
 as images at various resolutions, these are linked together in the
 physical references file.
 The data files contain the data making up a document.  Any format can
 be accommodated: image files, ASCII text, PostScript, etc.  However,
 one-to-one correspondence between data files for a given physical
 reference is assumed.  That is, if there are multiple file types for
 a single page, these files should represent exactly the same
 information.

Physical References File

 The Physical References file is the component of the document which
 relates logical structures (logical components of documents) to
 physical files.  Document references, by which a document can be
 composed of all or part of other documents possibly residing on
 different servers, are handled in the Physical References file.
 A document may contain multiple document objects, each of which
 contains one or more data objects.  When a document contains actual
 physical data (for example, it is created by scanning or importing
 images), a Master Document Object is created.  When a document
 incorporates components of other documents, a Reference Document
 Object is created for each of the other documents.  The Document
 Objects are numbered with internal reference numbers, which are
 included in the corresponding Data Object lines.
 Data Object lines include the Document Object number, the file
 reference number, and the file type.  The Document Object number
 refers to a Document Object line, from which the library name,
 collection name, and document ID can be retrieved.  The tuple
 <libraryID>+<collectionID>+<documentID>+<filetype>+<file reference>
 is guaranteed to locate a file.  Each Data Object line refers to a
 single file; where multiple file types of a single document page
 exist, there will be multiple Data Object lines for that page.

Turner [Page 4] RFC 1691 CDL Document Architecture August 1994

 In the file, all Document Object lines will preceed all Data Object
 lines for a given document.  Document Object lines may be either
 grouped together at the beginning of the file, or may immediately
 preceed the first Data Object line for the Document Object. Document
 Object lines will appear in order by Document Object number.  Data
 Object lines will appear in order by sequence number, NOT by Document
 Object number.
 The fields in the Physical References file are delimited by vertical
 bars.

Document Object Lines

 Field   Description                  Comments
 -----   ----------------------       ----------------------------
   1     Document Object number       0 => Master Document Object
                                      1-9 => Reference Document Object
   2     Library name                 Server name
   3     Collection name
   4     Document ID                  8-digit number
   5     Author name
   6     Volume
   7     Title
   8     Edition

Data Object Lines

 Field   Description                  Comments
 -----   ----------------------       ----------------------------
   1     Document Object number       Corresponds to above
   2     Sequence number
   3     File reference               Reference number used to locate
                                      file in filing system
   4     Physical reference number    Equal to Logical Structure file
   5     File type                    1 = TIFF 600dpi
                                      2 = TIFF thumbnail
                                      3 = ASCII version of page
                                          (i.e., OCR output)
                                      4 = ASCII notes
                                      5 = Other
                                      6 = TIFF 300dpi
   6     Note

Turner [Page 5] RFC 1691 CDL Document Architecture August 1994

Physical References File Example

+0|CORNELL|OLINLIB|00000001|Boole, Mary Everest||Philosophy Of Algebra||

010000000251
020000000352
030000000461
040000000562
 Note that in the above, it is guaranteed that file references 2 and 3
 are two different versions of the same page, as are file references 4
 and 5.

Logical Structure File

 The Logical Structure file is the component of the document structure
 which offers "views" of a document and links images together
 logically to define documents. The file is actually an unloaded tree;
 when a document is "opened", the file is read and the tree
 reconstructed. By convention, all Logical Structure files contain one
 logical structure "PAGES" which defines the document by listing the
 pages in the order in which they appeared in the original document.

Document Structure lines

 Field   Description                  Comments
 -----   ----------------------       ----------------------------
   1     Parent structure number      Structure is a child of...
   2     Sequence number
   3     Logical Structure name       Label for this structure
   4     Structure number             Equal to Physical Reference file
   5     Logical Children             # of logical children of this
                                        structure

Document Structure lines (continued)

 Field   Description                  Comments
 -----   ----------------------       ----------------------------
   6     Physical Children            # of physical children of this
                                        structure
   7     References                   # of references to this
                                        structure within this document
                                      (for how many structures is this
                                       a substructure)

Turner [Page 6] RFC 1691 CDL Document Architecture August 1994

Logical Structure File Example

00ROOT0400
01PAGES110001
02CONTENTS22201
11Production note5022 Str. 5 is child of structure 1
                            ...has a label "Production note"
                            ...has no logical children
                            ...has 2 physical references
                            ...is referenced twice in this document
126021
137021
148021
159021

199103022
1100104022
21Production note105101
22Title page106101
23Table of contents107201
24Chapter 1. From Arithmetic to Algebra108601
25Chapter 2. The Making of Algebras109401
26Chapter 3. Simultaneous Problems110401
27Chapter 4. Partial Solutions…111301
28Chapter 5. Mathematical Certainty…112301
29Chapter 6. The First Hebrew Algebra113801
210Chapter 7. How to Choose our Hypotheses114901
211Chapter 8. The Limits of the Teachers Function115501
212Chapter 9. The Use of Sewing Cards116401

220Chapter 17. From Bondage to Freedom124501
221Appendix125211
222advertisements126412
1051Production note5022
1061Title page11022
1071715022
1072816022

1264104022

Turner [Page 7] RFC 1691 CDL Document Architecture August 1994

Implementation Details

 The tuple <library ID>+<collection ID>+<document ID>+<filetype>+
 <file reference> is guaranteed to locate a file.  A file locator
 program will translate between this tuple and the fully-qualified
 path and file name in the underlying file system.  While a library
 will always have a hierarchical nature corresponding to UNIX file
 systems, the order of the hierarchy will be flexible to accommodate
 optimization efforts.  Each level of the hierarchy will have an INFO
 file that describes the order of the lower levels of the hierarchy.
 The file locator program will read these files as it navigates the
 directory structure of the file system when a library, collection, or
 document is opened.  Two examples follow:
   Example 1.  Hierarchy is LIBRARY, COLLECTION, DOCUMENT, FILETYPE.
/<library name>
        LIBINFO.TXT                      Description of library
        /<collection name>
               COLINFO.TXT               Description of collection
               /<document ID>
                     DOCINFO.TXT         Description of document
                     LOGSTR.000          Logical structure file
                     PHYSREF.000         Physical reference file
                     /<filetype1>
                             00001.TIF
                             00002.TIF
                             ...
                     /<filetype2>
                             00001.TIF
                             00002.TIF
                             ...

Turner [Page 8] RFC 1691 CDL Document Architecture August 1994

 Example 2.  Hierarchy is LIBRARY, FILETYPE, COLLECTION, DOCUMENT.
/<library name>
        LIBINFO.TXT                         Description of library
        /<filetype1>
                /<collection name>
                       COLINFO.TXT          Description of collection
                       /<document ID>
                             DOCINFO.TXT    Description of document
                             LOGSTR.000     Logical structure file
                             PHYSREF.000    Physical reference file
                             00001.TIF
                             00002.TIF
                             ...
        /<filetype2>
                /<collection name>
                       COLINFO.TXT          Description of collection
                       /<document ID>
                             DOCINFO.TXT    Description of document
                             LOGSTR.000     Logical structure file
                             PHYSREF.000    Physical reference file
                             00001.TIF
                             00002.TIF
                             ....
 This implementation involves some redundancy, but it permits complete
 copies of a collection to be mounted on different file systems for
 performance considerations.  In particular, the second scheme would
 facilitate storing all low-resolution images on high-speed magnetic
 disk for fast access, and all high-resolution images on slower, less
 expensive storage.  This will also facilitate authorizing access to
 low-resolution images by other software systems (FTP, Gopher) while
 restricting access to high-resolution images.

Turner [Page 9] RFC 1691 CDL Document Architecture August 1994

Security Considerations

 Security issues are not discussed in this memo.

References

 [1] Turner, W., "Cornell Digital Library Document Architecture,
     Version 1.1 - 3/22/94", Library Technology Department, Cornell
     University.

Author's Address

     William Turner
     Library Technology
     502 Olin Library
     Cornell University
     Ithaca, NY  14853
     Phone: 607-255-9098
     Fax:   607-255-9346
     EMail: wrt1@cornell.edu

Turner [Page 10]

/data/webs/external/dokuwiki/data/pages/rfc/rfc1691.txt · Last modified: 1994/08/16 20:38 by 127.0.0.1

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki