Serving Users in Many Languages

Cross-Language Information Retrieval
for Digital Libraries

Douglas W. Oard
Digital Library Research Group
College of Library and Information Services
University of Maryland, College Park, MD, USA
http://www.glue.umd.edu/~oard
[email protected] D-Lib Magazine, December 1997

ISSN 1082-9873

We are rapidly constructing an extensive network infrastructure for moving information across national boundaries, but much remains to be done before linguistic barriers can be surmounted as effectively as geographic ones. Users seeking information from a digital library could benefit from the ability to query large collections once using a single language, even when more than one language is present in the collection. If the information they locate is not available in a language that they can read, some form of translation will be needed. At present, multilingual thesauri such as EUROVOC help to address this challenge by facilitating controlled vocabulary search using terms from several languages, and services such as INSPEC produce English abstracts for documents in other languages. On the other hand, support for free text searching across languages is not yet widely deployed, and fully automatic machine translation is presently neither sufficiently fast nor sufficiently accurate to adequately support interactive cross-language information seeking. An active and rapidly growing research community has coalesced around these and other related issues, applying techniques drawn from several fields - notably information retrieval and natural language processing - to provide access to large multilingual collections.

Technical Approaches

A controlled vocabulary information retrieval system can be very useful in the hands of a skilled searcher, but end users often find free text searching to be more helpful. Indexing free text is also potentially more economical because the human effort expended on thesaurus construction and index term assignment can be minimized. Free text retrieval systems typically rely on matching features derived from query terms with features derived from the terms that appear in the document collection. Cross-language free text retrieval thus requires that either the representation of the query or the representation of the document (or both) be translated so that the two representations are compatible. When storage is limited or several languages must be accommodated, translating the query is more practical than translating each document into every language. On the other hand, a strategy based on document translation can permit the translation workload to be performed at indexing time, a useful optimization when every translation will eventually be examined by several users. These alternatives -- and variations on them such as mapping both the queries and the documents into language-independent representations -- present fundamental tradeoffs that designers of cross-language information retrieval systems must consider.

Document translation essentially reduces cross-language retrieval to its monolingual equivalent, but query translation strategies can impose unique requirements on the five retrieval system components shown in Figure 1. Free text queries posed by end users are often quite short, making it difficult to use context to limit translation ambiguity. Providing facilities for interactive disambiguation in the query formulation interface thus offers the potential to improve retrieval effectiveness (c.f., Davis and Ogden 1997). Cross-language matching offers several intriguing possibilities, including automatic query enrichment both before and after translation (Ballesteros and Croft 1997) and cognate matching for out-of-vocabulary words (Buckley et al 1997). When the user is not fluent in the language that a document is written in, interactive selection of promising documents from a list can be facilitated using simple translation techniques that are tuned for the semantics of titles and other displayed metadata (c.f., Hayashi et al 1997). Similarly, rapid translation approaches that support interactive browsing on a document-by-document basis will also be needed (c.f., Resnik 1997). Finally, delivery of a useful document to the user may require either automated or machine-assisted translation of the entire document (c.f., Jordan et al 1993).

Figure 1. Information retrieval system components.

User Needs

The technical issues described above have received a good deal of attention in recent years (c.f., Oard 1997), but a comprehensive exposition of the user needs that motivate all of this activity has yet to appear. This article is intended to provide a starting point for that discussion by offering some thoughts about what those needs might be. The focus of a user needs assessment process is identification of the users and the tasks that they will seek to accomplish. Table1 lists some typical cross-language retrieval tasks.

Search a monolingual collection in a language that the user cannot read.
Retrieve information from a multilingual collection using a query in a single language.
Select images from a collection indexed with free text captions in an unfamiliar language.
Locate documents in a multilingual collection of scanned page images.

Table 1. Some types of cross-language information retrieval tasks.

Commercial online services such as Dialog, Lexis/Nexis, and West Publishing provide a large number of high quality monolingual text collections, and similar services exist around the world in other highly developed information economies. Significant markets have developed for retrospective searches that seek existing information and for filtering services that alert clients to new information. In recent years, these services have responded to an increasing demand for end-user searching by expanding the free text search capabilities that they offer. At present, however, the cross-language search capabilities provided by these services are limited to controlled vocabulary searching of the relatively few collections for which a multilingual thesaurus is available. In view of the increasingly interdependent international trade regime, the combination of effective cross-language free text retrieval and responsive creation of suitable translations for the retrieved documents could significantly expand the market for these services.

Another large market with fairly obvious potential is the World Wide Web. Existing search engines index web pages in many languages, and some (e.g., AltaVista) can limit the returned set to web pages in a language selected by the user. A few search engines, such as TITAN and MUNDIAL, offer experimental cross-language search capabilities, but the technology is not yet widely deployed. The limited revenue stream that can be generated from advertising makes present web search services extremely sensitive to marginal costs, so near term penetration of this market will likely require very efficient cross-language retrieval techniques. On the other hand, some components of the cross-language retrieval infrastructure are already in widespread use. For example, inexpensive client-side English-to-Japanese translation software is a popular accessory for web browsers in Japan, and AltaVista now allows users to translate the first few hundred words of a web page into another language.

Individual governments and international institutions also routinely create and search large collections of multilingual information. For example, the European Parliament generates information in all of the official languages of the European Union and controlled vocabulary indexing is used to provide access to much of this material. The U.S. Foreign Broadcast Information Service (FBIS) illustrates an alternative dissemination approach. FBIS obtains information from news sources around the world and distributes English translations to government and academic users. Although this massive translation approach facilitates monolingual free text retrieval, it could delay dissemination somewhat. Furthermore, cross language retrieval technology could reduce overall costs if the present translations are sparsely accessed by the intended users.

Some applications for cross-language search technology may not even require that the retrieved object be translated before being presented to the user. Free text annotations such as captions are often used to index non-print media such as photographs, sound recordings and video clips, and large collections of such materials are presently maintained by newspapers and radio and television networks. A substantial commercial market has developed for rapid retrieval of non-print materials from these collections. Typically, the retrieved set provided by non-print media libraries includes both the materials and the captions. Users who are unable to read the captions may, however, be sufficiently satisfied with the materials themselves. Image search engines that index the text associated with a photograph (for example, the text on a web page containing a link to that image) are beginning to appear on the World Wide Web (e.g., WebSeer). An image search engine based on cross-language free text retrieval would be an excellent early application for cross-language retrieval because no translation software would be needed on the user's machine.

Each of the applications described so far exploits the existence of material that is represented digitally as characters. The true needs of real users are likely to extend far beyond such simple representations, however. Retrieval based on scanned page images or spoken language is rapidly becoming practical, and cross-language retrieval will likely be important for those modalities as well. Translation of spoken material is even more challenging than translation of character-coded text, however, and optical character recognition often introduces errors that would confound present fully automatic machine translation systems. As a result, the initial multilingual applications of this technology may focus on retrieval of photographs, video clips and similar objects that are associated with scanned page images or spoken information rather than with the scanned text or spoken information itself.

Worldwide Initiatives

Language translation and monolingual information retrieval technologies have been subjects of long standing study around the world, and the growing demand for those services will likely foster continued research support. More recently, both the United States and the European Union have begun sponsoring focused research on cross-language information retrieval. The European Multilingual Information Retrieval (EMIR) project, an ESPRIT initiative begun in 1991, extended the French SPIRIT system to perform large-scale free text retrieval between English, French and German. The EMIR project formally ended in 1994, but work on SPIRIT is continuing. European cross-language retrieval research continued with CRISTAL, a European Commission Telematics Application Program (DG XIII/E) project in the Language Engineering sector. CRISTAL investigated retrieval in English, French and Italian between 1993 and 1996. Cross-language retrieval is also a research goal of EuroWordNet, an ongoing project in the Language Engineering sector.

Most of the more recent European projects have focused on specific applications for cross-language retrieval technology. Two Telematics Application Program projects in the Telematics for Libraries sector, TRANSLIB and CANAL/LS, were active between 1995 and 1997. Both projects investigated cross-language searching in library catalogs, and each included English, Spanish and at least one other language. CANAL/LS added German and French, while TRANSLIB added Greek. The MULINEX project, a new Language Engineering initiative, is now investigating the application of similar techniques to web search engines. Another ongoing project is TwentyOne, an Information Engineering sector effort to develop a multilingual system for retrieving documents related to sustainable development. Although not a member of the European Union, Switzerland also supports research in cross-language retrieval at the Swiss Federal Institute of Technology. There has been corporate interest in Europe as well, most notably at the Rank Xerox Research Laboratory in France.

Research in the United States on cross-language free text retrieval has followed a similar pattern, emerging initially as part of several broader research initiatives. Bellcore reported the first experimental investigation of the topic in 1990 using French and English, and former members of the Bellcore group are continuing that work at Duke University and the University of Colorado. New Mexico State University began cross-language retrieval research in 1993, working mostly in English and Spanish, and work there is ongoing. Work on this topic began at the University of Maryland in 1994, and Carnegie Mellon University, the University of Arizona, the University of California at Berkeley, the University of Iowa, and the University of Massachusetts also now have ongoing efforts in this area. These projects have been funded by a number of federal agencies, often as part of a broader research program on information retrieval or natural language processing. Recently, however, the Defense Advanced Research Projects Agency (DARPA) Information Technology Office began the first focused effort in this area, simultaneously initiating three projects to investigate Translingual Information Management.

Fewer reports on cross-language retrieval research by groups in the Pacific Rim region are accessible in the western research literature, but that may result in part from exactly the sort of language barriers that we seek to overcome! In Japan, experimental work has been reported by KDD, NEC and NTT, and there is interest in starting a project at the National Center for Science Information Systems. In Australia, the Royal Melbourne Institute of Technology is investigating cross-language retrieval between English and Vietnamese. Some interest in cross-language retrieval has also been reported in Malaysia and Singapore.

Setting the Research Agenda

Cross-language information retrieval will pose a challenge to information systems throughout the world, so it is only natural to consider the problem from a global perspective. The United States and the European Union together have taken a first step in that direction, jointly sponsoring an international Digital Library working group on Multilingual Information Access that is described elsewhere in this issue (See Klavans and Schauble, Report on the EU-NSF Working Group on Multilingual Information Access). That working group's results, to be summarized in a white paper in the summer of 1998, can help to shape the Fifth Framework research program of the European Union and the second phase of the U.S. Digital Libraries Initiative. National and regional planning efforts will also be needed to focus additional attention on the unique needs of specific user groups. In the United States, for example, DARPA is presently examining the military requirements for cross-language retrieval and is considering additional investments in Translingual Information Management research.

Planning efforts such as those described above provide an opportunity to help focus the efforts of the research community, but the principal wellspring of innovation is the researchers themselves. Cross-language retrieval is an interdisciplinary challenge, drawing on techniques and resources from information retrieval and natural language processing, and this has helped to shape the venues within which the research has been widely reported and discussed. The ACM Special Interest Group on Information Retrieval (SIGIR) sponsored workshops on cross-language retrieval in 1996 and 1997, and papers have appeared at workshops associated with the International Joint Conference on Artificial Intelligence (IJCAI 97), the European Conference on Artificial Intelligence (ECAI 96), and the American Association for Artificial Intelligence (1997 Spring Symposium). It is clearly high time that we bring this work to the worldwide digital library conferences as well.

Until recently, the limited availability of multilingual test collections has made it difficult to evaluate the effectiveness of cross-language retrieval systems and nearly impossible to directly compare the reported effectiveness of techniques developed by different groups. In 1997, however, the U.S. National Institutes of Standards and Technology (NIST) and the Swiss Federal Institute of Technology obtained over one hundred thousand documents in each of three languages (English, French and German). NIST then developed relevance judgments for the documents in each language with respect to a common set of 25 topics as part of the Sixth Text Retrieval Conference (TREC-6). Cross-language retrieval drew a remarkable turnout in its first year at TREC, with 12 groups reporting experimental results. A second cross-language information retrieval track is planned for TREC-7 in 1998 -- participation is open to research groups around the world, with proposals due to NIST in early January 1998.

Conclusions

In examining the body of research on cross-language text retrieval, several themes emerge. The most obvious is that the volume of reported research in the field is exploding so rapidly that it will soon become impossible for any individual to keep up with it all. This is good news, for it indicates that the problem is widely regarded as important, interesting and sufficiently tractable to merit a substantial investment. A wide variety of techniques are now known, and the resources needed to construct and evaluate large scale systems are becoming available.

There are, however, several unmet challenges. The vast majority of the work reported to date has addressed the detection of documents in languages other than that chosen for the query. Much remains to be done with support for query formulation and document selection, although some interesting early work on those areas has been reported by New Mexico State University and NTT. Rapid examination of documents selected by the user that require translation may prove to be an even greater challenge, and timely delivery of high quality translations also poses some interesting research questions in a networked environment.

The most important open questions may not be technical, however, because the ultimate use of the technology will depend on the needs of the users. Which of the many possible applications for cross-language retrieval technology will predominate in the marketplace? Which will have the greatest social impact? How will the availability of this technology affect other aspects of the Global Information Infrastructure? While it is far too early to answer such questions, it is certainly not too soon to begin to ask them.

For Additional Information

Almost every active cross-language retrieval research group has made information about their work available on the World Wide Web, and a set of links to that work can be found at http://www.clis.umd.edu/dlrg/clir/

References

(Ballesteros and Croft 1997) Lisa Ballesteros and W. Bruce Croft, "Phrasal Translation and Query Expansion Techniques for Cross-Language Information Retrieval," in Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, July 1997, pp. 84-91. Earlier work by the same authors is available at http://ciir.cs.umass.edu/info/psfiles/irpubs/ir.html

(Buckley et al 1997) Chris Buckley, Mandar Mitra, Janet Walz and Claire Cardie, "Using Clustering and SuperConcepts Within SMART: TREC 6," in The Sixth Text REtrieval Conference, D. K. Harman, ed., National Institutes of Standards and Technology, To appear. Available soon at http://trec.nist.gov

(Davis and Ogden 1997) Mark W. Davis and William C. Ogden, "QUILT: Implementing a Large-Scale Cross-Language Text Retrieval System," in Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, July 1997, pp. 92-98. Earlier work by the same authors is available at http://crl.nmsu.edu/users/madavis/Site/Book2/book2-toc.html

(Jordan et al 1993) Pamela W. Jordan, Bonnie J. Dorr and John W. Benoit, "A First-Pass Approach for Evaluating Machine Translation Systems," Journal of Machine Translation, 8:1-2, pp. 49-58. Available at http://www.umiacs.umd.edu/~bonnie

(Hayashi et al 1997) Yoshihiko Hayashi, Gen'ichiro Kikui and Seiji Susaki, "TITAN: A Cross-Linguistic Search Engine for the WWW," in Cross-Language Text and Speech Retrieval, AAAI Technical Report SS-97-05. Available at http://www.clis.umd.edu/dlrg/filter/sss/papers/

(Oard 1997) Douglas W. Oard, "Alternative Approaches for Cross-Language Text Retrieval," in Cross-Language Text and Speech Retrieval, AAAI Technical Report SS-97-05. Available at http://www.clis.umd.edu/dlrg/filter/sss/papers/

(Resnik 1997) Philip Resnik, "Evaluating Multilingual Gisting of Web Pages," in Cross-Language Text and Speech Retrieval, AAAI Technical Report SS-97-05. Available at http://www.clis.umd.edu/dlrg/filter/sss/papers/

hdl:cnri.dlib/december97-oard