A Metadata Search Engine for Digital Language Archives

Search | Back Issues | Author Index | Title Index | Contents

D-Lib Magazine
February 2005

Volume 11 Number 2

ISSN 1082-9873

A Metadata Search Engine for Digital Language Archives

Baden Hughes and Amol Kamat
Department of Computer Science and Software Engineering
University of Melbourne
Parkville VIC 3010, Australia

	Abstract In this article we describe the design and implementation of a full-featured metadata search engine within the Open Language Archives Community (OLAC). Unlike many digital library search engines, this particular implementation has a high degree of affinity with web search engines in terms of reasoning and results display, and presumes no knowledge of the underlying metadata or database structures on behalf of the user. Features of the search engine include a variety of string matching algorithms; a thesaurus of alternate language names; language code searching; keyword-in-context display in search results; search for similarly spelled words; search for similar items; support for standard string search operators and domain-specific inline syntax; and automatically derived search links for other web search engines. A notable contribution of this research is the inclusion in the search engine results of a metadata quality-centric sorting algorithm. Introduction While much effort is invested in the creation of management and standards infrastructure for digital libraries, it is in reality the end user interface that determines the degree to which the rich metadata can be exploited for information retrieval tasks. Access to the holdings of digital libraries through simple yet powerful search interfaces is an often-neglected area for the digital archives community. In short, we can observe a juxtaposition of the rich metadata repositories with poor search facilities, which leads to a lower return on investment (and corresponding incentives) for the creation of rich metadata. In this article, we report the design and implementation of a metadata search engine within a specialised OAI [1] sub-domain, the Open Language Archives Community (OLAC) [2]. The purpose of this engine is to provide a powerful search facility across the collection of OLAC archives. Modeled on features common to web search engines, the service described here allows a user to explore relevant archive holdings without requiring any knowledge of the underlying data structures. Search results are displayed based on a metadata quality ranking scheme that in turn is derived from the archive's adherence to best practice recommendations in terms of metadata standards, both external and internal to the community. The structure of this article is as follows: first we provide some background regarding the community itself, followed by the motivation for the particular work reported here. Next we describe the design and implementation of the search engine. An evaluation of the search engine against other incumbent services is then followed by the description of a number of items for future work. Finally we reflect on the success of this service and consider the wider implications for the digital libraries community as a whole. Background The Open Language Archives Community (OLAC) is a consortium of linguistic data archives, consisting of 31 archives and a corresponding catalogue of more than 28,000 objects described by metadata. For a more detailed description, we refer interested readers to Bird and Simons, 2003 [16] and Simons and Bird, 2003 [15]. OLAC metadata is based on Dublin Core, with a number of extensions [14] to the Dublin Core Metadata Set [3] for relevant conceptual domains such as language [9], linguistic type [6], subject language [7] and linguistic subject [8] and linguistic role [10]. Derived from the model adopted within the OAI, the OLAC model has a two-tiered approach to implementation. Data providers are the institutional language archives that publish their XML-based metadata according to the OAI Static Repository standard [4]. Individual archives use a variety of software to manage their catalogues internally. Service providers leverage the OAI Protocol for Metadata Harvesting [5] to harvest the XML expressions of metadata catalogues. Within the OLAC community, typical practice is to aggregate these into an SQL database using the OLAC Harvester and Aggregator [11]. Service providers can then build services that utilise the union catalogue of OLAC metadata. As a metadata community and a virtual digital library, OLAC has motivated a number of developments at the OAI level, notably the need for supporting static repositories [4], the development of virtual service providers [12], and personal metadata creation and management tools [13]. Recent work by the authors [17] has resulted in a metadata quality evaluation infrastructure being added to the suite of OLAC services. The work here extends and leverages this earlier contribution and includes the quality of metadata as a deterministic factor in the display of search results. Motivation The need to facilitate efficient information discovery in digital libraries and archives is certainly not a new desiderata. However, it is notable that with recent efforts to create standardised catalogues for digital archives and libraries, much of the focus has been on the creation of rich metadata, rather than discovery services that effectively leverage metadata repositories to allow users to locate and interact with data itself. In this light, OLAC is no different. While resource discovery has been an underlying motivation for the establishment of metadata standards and extensions for language resources, tools that allow non-domain-specialist users to search for, and locate resources of interest have been largely neglected. Several specialist tools have emerged [27], [28]) but these do not provide interfaces that are intuitive to non-linguists, nor are they similar to widely used web search engines with their inclusion of value-added features. Furthermore, metadata communities are often grounded in contexts where collection management is the primary concern. As such, metadata creators typically view rich metadata as inherently valuable in itself. However, we believe that for a greater return on investment for metadata creation and management, high quality end user services are required to expose the rich metadata collections to wider audiences. Our motivations in this context are therefore to address some of these needs within the Open Language Archives Community. The goals of this work are four-fold. First, we seek to provide an information discovery service that counters the perception that searching the web at large is more intuitive and productive than searching within the catalogues of digital archives and libraries. Second, we are concerned with reducing the barrier to entry— eliminating the need for understanding on behalf of the user of the underlying metadata standards and database structures to effectively find information. Third, we believe the qualitative dimension of resource discovery allows opportunities to leverage the rich metadata collections we have, and to use quality-centric metrics (such as those found in Hughes [17]) as catalysts for ranking and displaying search results. Fourth, and finally, we are motivated by the need to provide compelling, value-added services as an incentive to create metadata for language resources [18]. Search Engine Description In this section we describe the design of the search engine [20], adopting a three-way distinction between the search input features, the search engine logic, and the search results display. Search Input The user interface at the search entry point is deliberately simple: a single text input field and the option to select one or all archives within OLAC. There is no presupposition as to what the input keywords may be. Minimally we conceive that input may be one or more of: language name, country name, linguistic data feature, personal name, or any combination of these types. Input is treated as case insensitive, and supports standard search operators such as `AND`, `OR`, etc. An additional feature is the support for inline syntax, derived from the element names that make up the OLAC metadata set (e.g., `creator: hale`), allowing a significant degree of flexibility with regard to the granularity of a search. Search Engine Logic The search engine first uses exact word matching against a full text index of metadata element content, soundex values for approximate string matching, and Ethnologue [23] data providing information on alternate language names. The default behaviour of the search is to match exact words in the search string and to return records that contain all words, using a full text index on the content of metadata elements. If exact matching is not successful, a range of heuristics is applied to determine the closest relevant matches. When a search by default yields no results, our aim is to minimize the number of mouse clicks a user requires to find information on either correct spellings of the search term, or information on topics related to the original search. In the event that there are no matching records found in the database, the query string is first checked to determine whether it is a language name. If it is the case that there are no records about an input language name, it is assumed to be more helpful to present alternate names for the language, rather than similarly spelled words. (This approach is grounded in the context of the complexity of language name standardisation, a well-known problem in the domain). Where this is not the case, similarly spelled words are suggested. These words are retrieved from a table of words and their soundex values, the vocabulary consisting of all distinct words found in the content of all OLAC metadata elements. The table is indexed on the soundex value, to provide faster lookup. The words with soundex equal to that of the search term are retrieved and sorted according to their Levenshtein edit distance [24] from the search term. Similarly, language names from the Ethnologue tables that have identical soundex values are ranked by Levenshtein distance and displayed as possible corrections to the original query string. Again, this separation of similarly spelled words and similarly spelled language names makes an assumption that language names are most frequently searched. The search engine also features an integrated domain thesaurus and ontology, derived from the authoritative source of language name classification, the SIL Ethnologue [23]. If the search string is identified as derivative of a language or country name, this is detected (by use of the Ethnologue data on alternate language and dialect names) and the user is presented with links to launch queries related to that language or country. Where the original search is for a language name, a link is available to view records that use an alternate name for that language or records that provide information on a dialect of the originally searched language. Similarly, if the original search string is a country name, a list of languages that are spoken in that country can be viewed. Recognising language and country names allows a link to the Ethnologue entry on that search term to be presented where relevant. Furthermore, the use of the domain thesaurus and ontology also allows users with linguistic domain knowledge to search by language code, which is a commonly used approach in other language resource repositories. Search Results Display Search results are grouped by archive, with the archives sorted based upon the aggregate of metadata quality scores for the retrieved records (and based on the work reported in Hughes [17]). This provides a balance in the presentation of search results between the quantity of matching records found in an archive and the quality of those records. Each retrieved record displays the element containing the matching word with key-word-in-context (KWIC) highlighting. Elements that have been deemed useful in providing information of record content (`title`, `description`, `subject`, `date`, `identifier`) are used to provide additional summary information. A number of additional utilities are provided to the user in the context of search results. These include links to instantiate precomposed queries for alternate names; language-code based searching; links to the Ethnologue; and precomposed combinatorial web queries for language names and linguistic domain-specific terms. Implementation The metadata quality assessment service is built on top of the foundational layer provided by the OLAC Harvester and Aggregator [11], and the metadata quality evaluation service, the OLAC Archive Reports [17]. The implementation uses open source technologies MySQL [21] and PHP [22], and it can be installed on a range of platforms. The operational instance of the OLAC Search Engine can be viewed online [20]. The software has been released under an open source license and is freely available to interested parties from Sourceforge [19]. Evaluation There are a number of other search services to which our contribution can be legitimately compared, as there are several search engines operational in the OLAC context. Here we compare our work to three specific services, and reflect on the relative strengths and weaknesses of each of the incumbent services. Search OLAC Archives [25] is an example of a very basic search engine and provides single keyword search over harvested archives from the union catalogue. Our implementation is similar in that it operates as a service over the union catalogue of OLAC metadata. However, our implementation is significantly more powerful and more fully featured in many areas, including support for up to 5 keywords. The OLAC Web Crawler Gateway is a customised version of the OAI DP9 gateway service ([26]) that allows indexing of an OAI data provider by an Internet search engine. DP9 does this by providing a persistent URL for repository records and converting this to an OAI query against the appropriate repository when the URL is requested. This allows search engines that do not support the OAI protocol to index the "deep web" contained within OAI-compliant repositories. Thus the user is constrained only by their choice of web search engine, since these engines are in fact the entry point to the OLAC data. While the motivation behind the DP9 interface is different to our own, we do note that there is a feature in common between our two implementations: the ability to search for a metadata record using an OAI record identifier. For searching motivated by third-party referencing, this feature is invaluable. Furthermore, our engine does not preclude a broad coverage search engine being used to complement finer-grained searching. The LinguistList offers two implementations of an OLAC search engine ([27] and [28]). In the default Search mode, searching is supported by up to 10 keywords, but only within the `title`, `description`, `creator` and `contributor` elements. The search engine reported here, while not being able to support such a large number of keywords, allows a full text search of the union catalogue, and hence has a greater depth. However, it is possible to search by a constrained set of elements using the inline syntax method described earlier. In the LinguistList Advanced Search mode, searching is supported by up to 10 keywords, with the ability for the user to specify closer constraints such as the archive of interest, specific elements (e.g., `title`, `creator`, `contributor`) or to choose codes from particular controlled vocabularies `subject`, `language`, `type`, `discourse type`. The latter feature is also found in our search engine, although we support the function by virtue of inline syntax and in fact extend it to include any metadata element (either Dublin Core or OLAC). Furthermore, our approach allows a user to select finer-grained search context directly from the result display, rather than in advance for any information retrieval task. We can see from this comparison that the contribution described here provides all of the functionality found in other instantiations within OLAC, extends a number of existing functions, and provides a range of new features. Thus we believe our contribution is valuable in allowing the user great flexibility in approach to searching for language resources, by combining existing and discovered knowledge about language resources to increase the precision of search results. Future Now that we have described our full-featured search engine within the OLAC domain, we turn to a discussion of future work. We identify three natural extensions to the work reported here: support for similar metadata record searching, the development of an API, and user-centric functionality modifications. Web search engines such as Google provide a function enabling a search for similar results to a user-selected item. We would like to extend the work here to also allow such functionality, and we can conceive of three dimensions of similarity: metadata records of a similar score; metadata records with similar core elements (`title`, `description`, `subject`, `date` and `identifier`); and metadata records from similar archives. Within the OLAC context a precedent has already been set for machine-executable services for discovery of language resources [12]. Within the web search engine community such features are also common e.g. Google API [29]). We would like to extend the search engine described here to also facilitate machine-instantiated information retrieval in a similar fashion. Crucial to the success of any search engine is customisation to the specific needs of the user community. Web search engine usability metrics and case studies (see, for example, Nielsen [30]) motivate the active analysis of user interaction with a search service, and subsequent modifications to reflect the way in which users actually interact with the search engine. We propose therefore to undertake an extended evaluation of the types of searches that users conduct, and the types of result manipulation they undertake, in order to customise the user interface and underlying functionality to better serve those users interested in language resources. Conclusion Our work here has resulted in the deployment of a full-featured search engine, perhaps more typical of a web search engine than those traditionally found within libraries. We believe that some of the features typically found in web search engines can be offered within the digital libraries and archives community, with a corresponding improvement not only in information retrieval terms of precision and recall, but also in terms of user experience. We offer our approach and implementation to the broader digital libraries community in the hope that the model and implementation may benefit a larger range of institutional data providers and, ultimately, end-users. Acknowledgements We are grateful to Steven Bird for editorial comments on an earlier version of this article. The work reported in this article has been sponsored by the National Science Foundation Grant Numbers 9910603 (ISLE: International Standards in Language Engineering) and 0094934 (Querying Linguistic Databases). Bibliography [1] Open Archives Initiative. <http://www.openarchives.org>. [2] Open Language Archives Community. <http://www.language-archives.org>. [3] Dublin Core Metadata Element Set, Version 1.1: Reference Description <http://dublincore.org/documents/dces/>. [4] Patrick Hochstenbach, Henry Jerez, Herbert Van de Sompel, 2003. The OAI-PMH Static Repository and Static Repository Gateway. Proceedings of the IEEE/ACM Joint Conference on Digital Libraries 2003 (JDCL'03). [5] Carl Lagoze, Herbert Van de Sompel, Michael Nelson and Simeon Warner, 2002. The Open Archives Initiative Protocol for Metadata Harvesting. <http://www.openarchives.org/OAI/openarchivesprotocol.html>. [6] Helen Aristar-Dry and Heidi Johnson, 2002. OLAC Linguistic Data Type Vocabulary. <http://www.language-archives.org/REC/type.html>. [7] Gary Simons and Steven Bird, 2003. OLAC Subject Language Vocabulary. <http://www.language-archives.org/REC/language.html>. [8] Helen Aristar-Dry and Michael Appleby, 2003. OLAC Linguistic Subject Vocabulary. <http://www.language-archives.org/REC/field.html>. [9] Gary Simons and Steven Bird, 2003. OLAC Language Vocabulary. <http://www.language-archives.org/REC/language.html>. [10] Heidi Johnson, 2003. OLAC Role Vocabulary. <http://www.language-archives.org/REC/role.html>. [11] Gary Simons, 2003. A Query Facility for the Selective Harvesting of OLAC Metadata. <http://www.language-archives.org/NOTE/query.html>. [12] Gary Simons, 2003. OLAC Virtual Service Provider. <http://www.language-archives.org/viser>. [13] Kurt Maly, Mohammad Zubair and Xiaoming Liu, 2001. Kepler. An OAI Data/Service Provider for the Individual. D-Lib Magazine 7(4). <doi:10.1045/april2001-maly>. [14] Gary Simons and Steven Bird, 2002. Recommended Metadata Extensions. <http://www.language-archives.org/REC/olac-extensions.html>. [15] Gary Simons and Steven Bird, 2003. The Open Language Archives Community: An infrastructure for distributed archiving of language resources Literary and Linguistic Computing 18, pp.117-128. [16] Steven Bird and Gary Simons, 2003. Extending Dublin Core Metadata to support the description and discovery of language resources. Computing and the Humanities 37, pp.375-388. [17] Baden Hughes, 2004. Metadata Quality Evaluation: Experience from the Open Archives Initiative. Proceedings of the 7th International Conference on Asian Digital Libraries (ICADL 2004). Lecture Notes on Computer Science 3334. pp 320-329. Springer-Verlag. [18] Baden Hughes, 2004. Perspectives on Metadata. Proceedings of the LREC 2004 Building the Language Resources and Evaluation Roadmap: Joint COCOSDA and ICCWLRE Meeting. European Language Resources Association: Paris. [19] Open Language Archives Community Project at Sourceforge. <http://sourceforge.net/projects/olac>. [20] OLAC Search Engine. <http://www.language-archives.org/tools/search/>. [21] MySQL Database Engine. <http://www.mysql.com>. [22] PHP Scripting Engine. <http://www.php.net>. [23] Ethnologue: Languages of the World, 14th Edition. SIL International. <http://www.ethnologue.org>. [24] Vladimir I. Levenshtein, 1966. Binary codes capable of correcting deletions, insertions, and reversals Soviet Physics Doklady, 10(8). pp.707-710. [25] Search Language Archives. <http://www.language-archives.org/tools/search.php4>. [26] DP9: An OAI Gateway Service for Web Crawlers. <http://arc.cs.odu.edu:8080/dp9/about.jsp>. [27] Open Language Archives Community: LINGUIST List Gateway. <http://linguistlist.org/olac/>. [28] Open Language Archives Community: LINGUIST List Gateway (Advanced Search for Linguistic Resources). <http://cf.linguistlist.org/cfdocs/new-website/LL-WorkingDirs/olac/olac-search-advanced.cfm>. [29] Google Web API. <http://www.google.com/apis>. [30] Jakob Nielsen, 1999. Designing Web Usability. New Riders Publishing. Copyright © 2005 Baden Hughes and Amol Kamat

	Top \| Contents Search \| Author Index \| Title Index \| Back Issues Previous Article \| Next article Home \| E-mail the Editor

	D-Lib Magazine Access Terms and Conditions doi:10.1045/february2005-hughes