Articles
spacer

D-Lib Magazine
October 2001

Volume 7 Number 10

ISSN 1082-9873

Retrieval Issues for the Colorado Digitization Project's Heritage Database

 

William A. Garrison
University of Colorado, Boulder
[email protected]

Red Line

spacer

Abstract

The Colorado Digitization Project (CDP), begun in the fall of 1998, is a collaborative initiative involving Colorado's archives, historical societies, libraries, and museums. The project is creating a union catalog of metadata records and has developed tools for the creators of metadata records, the assignment of subject headings, and the use of name headings. The CDP is also investigating the use of Dewey Decimal Classification numbers through WebDewey to allow linkage of general subject terms and highly specialized subject terms within a subject browse feature of the union catalog.

1. Project Overview

The Colorado Digitization Project (CDP), a collaborative of Colorado's archives, historical societies, libraries, and museums, has undertaken an initiative to increase user access to the special collections and unique resources held by these institutions. Through digitization and distribution via the Internet, the CDP is creating a virtual digital collection of resources to provide the people of Colorado access to the rich historical, scientific and cultural resources of their state. The virtual collection will include such resources as letters, diaries, government documents, manuscripts, music scores, and digital versions of exhibits, artifacts, oral histories, and maps [1]. Project participants can contribute content that has been reformatted into digital format as well as content that was born digital.

2. Development of a Union Catalog of Metadata

Since a key objective of the project is to increase access to digital collections, the first effort undertaken was identifying the approaches used by the existing project participants to provide access to their collections. Even at the early stages of the project, it became clear that the participating institutions used different approaches and that there was no dominant standard or approach used by all. The CDP reviewed the differing approaches for common elements and current and emerging standards, including Encoded Archival Description (EAD), MARC, Government Information Locator Service (GILS), Dublin Core (DC), Visual Resources Association (VRA), etc. As web searching would not provide the desired access, and a single centralized metadata and image system would not be politically or financially feasible, the CDP recommended the development of a union catalog of metadata to provide the desired level of access, hoping that future developments in web searching would negate the long-term need for the union catalog. Based on an analysis of the metadata standards, the CDP adopted the Dublin Core/XML metadata standard for the union catalog. Creation of a union catalog of metadata involves a wide range of issues. This paper focuses primarily on the retrieval issues involved.

CDP cultural heritage institutions represent many specialized institutions, for example the Florissant Fossil Beds National Monument with its large collection of unique fossils, the Crow Canyon Archaeological Center with its collection of archaeological materials, and the Boulder History Museum with its collection of more than 4,000 costumes and accessories. The first two institutions use taxonomies from their specialized fields to provide subject access to their collections, while the Boulder History Museum uses Chenhall's classification system [2]. At the same time, some of the smaller, more general collections in the CDP contain the same types of resources or subjects but use a more generalized subject heading list for subject analysis, such as the Library of Congress Subject Headings (LCSH), or they may use only uncontrolled vocabulary for subject access. The CDP union catalog will provide access to the entire range of subject terms used by participating institutions and will do so without an authority control system. As a result, unless the user knows both the general and specialized taxonomy, retrieval will be limited to term input.

Name headings, both personal and corporate, also present a unique challenge. For the most part, libraries will use authorized forms of headings (i.e., headings appearing in the authority files in OCLC, RLIN, or other authority files) or headings established according to AACR2. Other types of institutions may have no resource or authority files to use and will use only the forms of headings available to them. As a result of these differences in name headings used, there will most likely be multiple forms of headings for a single person or corporate body, and there may be a single heading for multiple persons (i.e., non-unique, unestablished name headings).

3. Colorado Term and Name Lists

While no overall authority control exists as part of the CDP union catalog, two specific areas of authority control are being addressed. In order to assure some level of consistency in terminology, the CDP has developed a list of Colorado terms that a user can use to perform web searches. This list includes terms for Colorado geographic names and Library of Congress (LC) subject headings with the word Colorado or the abbreviation Colo. appearing in the subject string. Users can search the list by specific term or can browse the list. The term list is being generated from subject headings in the Prospector database, which is a union catalog reflecting the collections and holdings of sixteen major public and academic research libraries in Colorado and Wyoming. The term list may be searched directly, or it may be searched while creating or modifying records on the CDP input site [3]. Institutions creating metadata records can search the term list for LC subject headings to use if another subject list or thesaurus is not being used. The CDP has begun exploring the idea of creating a thesaurus and/or a full authority file from the term list. The latter would be approached through a statewide Name Authority Cooperative (NACO) / Subject Authority Cooperative (SACO) project creating name headings and subject headings through the Program for Cooperative Cataloging. Figure 1 below illustrates a sample input screen showing the "search the Prospector database" from the input site.

Screen shot of the CDP Input Site Record Entry page

Figure 1. CDP Input Site

 

The CDP has also created a name list of Colorado authors (both personal and corporate) using the Prospector database. The name list has been much more difficult to create than the subject term list. As with the subject list, the user has the option of searching the Prospector catalog from the CDP data entry system to retrieve names or to use the name list. The CDP is creating the subject term list and the names list to achieve consistency in terminology used in the union catalog even though the CDP recognizes that some project participants will not use the lists [4]. Figure 2 shows how a search from the subject list might look, and Figure 3 illustrates a search from the names list.

Screen shot of the Sample Subject Term Search

Figure 2. Sample Subject Term Search

 

Screen shot of the Sample Name Search

Figure 3. Sample Name Search

 

The term list and the name list contain no duplicate entries; however, it should be noted that the lists contain some errors (e.g., Colorado as a whole word when the abbreviation Colo. should have been used and occasional misspellings of terms). The CDP has not yet begun to deal with the errors in these lists because as long as the database remains relatively small, putting the access points in the Heritage database under authority control is not a priority issue. Nevertheless, as the database grows the issue of authority control will loom larger.

4. The Union Catalog's Use of Dublin Core

The CDP has named its union catalog Heritage, and it uses the OCLC SiteSearch software for the public interface. Project participants may use ftp (file transfer protocol) to send records to the CDP for loading into the Heritage database or participants may create records in a metadata input system developed and programmed by the CDP. Records created in the input system are marked for transfer to Heritage when complete. Participants contributing records in MARC or another format provide the CDP with a profile so that the fields coming into the CDP database can be mapped to the CDP Dublin Core record format [5]. It should be noted that the CDP has departed slightly from the DC elements as currently defined by the W3C [6]. A core set of mandatory elements has been defined similar to the "core record" developed by the Program for Cooperative Cataloging. The mandatory elements are Title, Creator, Subject, Description, Identifier, Date and Format.

5. Links from the Union Catalog to Digital Images

The CDP does not store the actual digital image files, but instead provides links from the union catalog to the images residing on the project particpants' servers. Participants may create metadata records for each image residing locally, or they may create one metadata record for a "collection" consisting of many images. For example, one of the participating sites stores several digital images of the Moffat Tunnel but has created only one metadata record and access point for the "collection" of tunnel images. In this example, the "collection" includes four images of the East Portal of the Moffat Tunnel and four images of the West Portal. The user can view eight images of the Moffat Tunnel but retrieves only one metadata record.

Figures 4 - 6 below illustrate a search for the Moffat Tunnel records in Heritage, the resulting display record, and the appropriate website link.

Screen shot of the Moffat Tunnel Search

Figure 4. Moffat Tunnel Search

 

Screen shot of the Moffat Tunnel Search

Figure 5. Moffat Display Record

 

Screen shot of the Moffat Tunnel Search

Figure 6. Moffat Tunnel Website Link

 

The Moffat Tunnel display record in Figure 5 provides an excellent example of the authority control problems faced by the CDP. The first subject entry reads "Moffat Tunnel" when the correct subject heading from LCSH is "Moffat Tunnel (Colo.)." The second subject entry displayed is "Denver and Rio Grande Railroad, Moffat Road," which has no equivalent in LCSH other than the corporate entry for "Denver and Rio Grande Railroad Company," and even that probably should be under the later name of "Denver and Rio Grande Western Railroad Company" for the years 1923-1927. The third subject entry is "Railroad Construction and Tunnel Construction," which in LCSH would be split into two subject headings: "Railroads--Design and construction" and "Tunneling." Bringing all these headings under authority control would be desirable but would require a considerable amount of human intervention, something the CDP is not currently staffed to perform. The above example, of course, assumes that the controlled vocabulary of choice is LCSH.

6. Subject Retrieval and Mapping to Dewey Decimal Classification

The CDP proposes to map the subject terms in the Heritage database to Dewey Decimal Classification numbers. Currently, if project participants input Dewey numbers into the metadata, they must either use a printed version of Dewey and manually key the Dewey number or must search WebDewey to obtain a classification number and then type or copy/paste the numbers into the metadata record. A model such as the one developed for use in the Cooperative Online Resource Catalog (CORC) seems desirable for the CDP to follow [7]. Automatic machine classification needs to be further developed before it can be implemented in a project like the CDP. The classification process as described by Hickey and Vizine-Goetz, where the Dewey schedules may be displayed with scope notes, related topics, and references to the DDC manual, provides a tremendous tool for the classifier. The ability to index and search DDC numbers in a catalog like Heritage may also prove useful [8]. However, the use of the actual classification schedules with notes and references to the manual would most likely not help the public patron searching the CDP database. For public patrons, a display of terms with context would be more useful.

The example below models how the retrieval might look in the CDP catalog for a user entering the subject search term "gold".

Search term: Gold
   
Results and context No. of hits
Gold mines and mining 3
Gold prospecting/prospectors 4
Gold tableware 1
Gold coins 2

When the term "gold" is entered, the system retrieves the following Dewey Decimal Classification numbers: 622.3422 (Gold mines and mining), 622.1841 (Gold prospecting), 739.2283 (Gold tableware), and 737.43 (Gold coins). The classification numbers would have matched or been mapped to terms in the metadata records. The end user does not see the DDC numbers retrieved but instead views the terms/context needed to select records that match his or her needs. Using this model, the user retrieves records for digital objects by subject regardless of subject thesaurus or taxonomy used.

The CDP also needs to address the issue of which display terms should appear to the user. The terms appearing in the example above do not necessarily match the terms as used in the Dewey schedules. "Gold mines and mining" appears as 622.3422 in the schedules with the caption text "*Gold". Simply displaying "*Gold" to the end user is not going to be meaningful, as the context is not clear without the hierarchy. Therefore, a method needs to be developed to get the display term to include the hierarchical context as it would appear in the Dewey schedule. Figure 7 shows the WebDewey display for "Technology" in which the term "*Gold" appears.

WebDewey Display of Search for Term 'Technology'

Figure 7. WebDewey Display of Search for Term "Technology"

 

There is a particular problem with mapping subject terms in the Heritage database to DDC numbers for an institution such as the Florissant Fossil Beds National Monument that uses a specific taxonomy to classify digital images of fossils. The DDC numbers for fossil invertebrates in class number 569 displayed in Figure 8 below provide an example of the detailed breakdown with taxonomic names in the schedules.

Dewey Decimal Classification Numbers for Fossil Invertabrates

Figure 8. Dewey Decimal Classification Numbers for Fossil Invertabrates"

 

It is highly doubtful that anyone without a detailed knowledge of the above taxonomic terms would understand the retrieval result if these were the terms displayed. If an end user were looking for digital images of dog fossils, the terms from the DDC schedule above would not help. In the search for fossils of dogs, the end user would also not be helped by the DDC Relative Index, as there is no entry under dogs with respect to fossils. From this example, it becomes clear that there will need to be additional terms mapped by means such as the LCSH mapping that has been done in WebDewey.

Considerable work will need to be done to map the subject terms in the CDP database to the DDC numbers and a great deal of effort put into term retrieval and display. The end user must be able to retrieve records using both specific terminology and taxonomy as well as general terminology. The taxonomist needs to be able to retrieve digital images when searching for "canidae" while at the same time, the general user needs to retrieve digital images when searching for "dogs fossils".

In determining the level of specificity required to use DDC numbers for retrieval, it might be that the use of the DDC Table 2 geographic area -788 for Colorado becomes meaningless for some subject areas in a database consisting of digital resources located only in Colorado. All the digital images for gold mines will be for gold mines located in Colorado, so adding or mapping a DDC number into the metadata record of 622.342209788 may make the Table 2 notation irrelevant. This would not be the case were the database to be expanded to include records from other states or were there to be so many digital images of gold mines in Colorado that a county or local breakdown would be necessary.

On the other hand, the breakdown in DDC Table 2 for Colorado locations may require expansion for the Heritage database, especially for materials dealing with local history. In most cases, classifying materials to the county level will not be sufficient. Figure 9 from DDC Table 2 shows the geographic locations for the Northern counties of the Rocky Mountains of Colorado.

DDC Table 2 Geographic Locations for Counties in Northern Colorado

Figure 9. Dewey Decimal Classification Table 2 Geographic Locations for Northern Colorado Counties"

None of the counties listed in Figure 9 above has further geographic, local areas. For example, the Table 2 notation for Boulder County -78863 may not be sufficient when distinctions need to be made (especially for historical documents or photographs) between local cities located within Boulder County, e.g., Boulder (city), Erie, Superior, Louisville, Longmont, etc. The CDP will need to determine exactly how this expansion will take place. One solution might be to develop a link to a system or site with GIS data to achieve this expansion.

7. Conclusions

For the foreseeable future, retrieval problems will continue in Heritage. No easy or simple solutions to the authority control problems exist. Dewey Decimal Classification seems to offer the most promising method for subject retrieval in databases like Heritage that contain subject vocabulary from multiple thesauri. Further testing will be necessary to determine the feasibility of using Dewey Decimal Classification for subject retrieval as described in this article. Retrieval and authority control issues for names (both personal and corporate) will be somewhat more difficult to resolve without human intervention and control. The possible statewide NACO/SACO project to create name headings and subject headings through the Program for Cooperative Cataloging might provide a solution to the problem of access and retrieval in Heritage. Heritage has just been opened to public view [9]. The author looks forward to the research continuing as the Heritage database is used by participating institutions and public patrons [10].

Notes and References

[1] Liz Bishoff and William A. Garrison. "Metadata, Cataloging, Digitization, and Retrieval: Who's Doing What To Whom: The Colorado Digitization Project Experience," Proceedings of the Bicentennial Conference on Bibliographic Control for the New Millennium (Washington, D.C.: Library of Congress, Cataloging Distribution Service, 2001), 377.

[2] James R. Blackaby and Patricia Greeno. The Revised Nomenclature For Museum Cataloging: A Revised and Expanded Version of Robert G. Chenhall's System for Classifying Man-Made Objects (Nashville, TN: American Association for State and Local History, 1988).

[3] For more information on the Prospector database see: Carmel Bush, William A. Garrison, George Machovec, and Helen I. Reed, "Prospector: A Multivendor, Multitype, and Multistate Western Union Catalog," Information Technology and Libraries, 19, no. 2 (June 2000): 71-83.

[4] Both the subject term and name term lists may be found at: <http://babu.coalliance.org:10101>.

[5] The CDP Dublin Core element set may be found at: <http://coloradodigital.coalliance.org/glines.html>.

[6] The Dublin Core element set may be found at: <http://dublincore.org/documents/dces>.

[7] Thomas B. Hickey and Diane Vizine-Goetz, "The Role of Classification in CORC," Annual Review of OCLC Research, 1999 [on-line]; available from <http://www.oclc.org/research/publications/arr/1999/hickey/corc.htm>; Internet; accessed 28 September 2001.

[8] Karen Markey Drabenstott, "Dewey Decimal Online Classification Project: Integration of a Library Schedule and Index into the Subject Searching Capabilities of an Online Catalogue," International Cataloguing, 14 (July 1985): 31-4.

[9] Heritage may be accessed from the CDP website by clicking on "Heritage" at: <http://coloradodigital.coalliance.org> or from the Access Colorado Library and Information Network (ACLIN) Colorado Virtual Library website under the "Create your own group" section at <http://www.aclin.org>.

[10] Portions of this paper were presented at and appear in "Subject Retrieval in a Networked Environment: Papers presented at an IFLA Satellite Meeting sponsored by the IFLA Section on Classification and Indexing & IFLA Section on Information Technology, OCLC, Dublin, Ohio, USA, 14-16 August 2001."

Copyright 2001 William A. Garrison
spacer
spacer

Top | Contents
Search | Author Index | Title Index | Back Issues
Previous Article | Conference Reports
Home | E-mail the Editor

spacer
spacer

D-Lib Magazine Access Terms and Conditions

DOI: 10.1045/october2001-garrison