D-Lib MagazineJanuary/February 2015 Semantic Enrichment and Search: A Case Study on Environmental Science Literature
Kalina Bontcheva, University of Sheffield, UK AbstractAs information discovery needs become more and more challenging, traditional keyword-based information retrieval methods are increasingly falling short in providing adequate support. The problem is often compounded by the poor quality of article metadata in some digital collections. This paper investigates automatic semantic enrichment and search methods, as ways to meet these challenges. In particular, the benefits of enriching articles with knowledge from Linked Open Data resources are investigated, with focus on the domain of environmental science. In order to facilitate environmental science researchers in carrying out better semantic searches, a form-based semantic search interface is proposed. It helps researchers to benefit from the semantically enriched content, e.g. to carry out sophisticated location-based searches. The usability and ease of learning of this web interface were evaluated in a user-based study, the results of which are also reported. 1 IntroductionEnvironmental Science is a broad, interdisciplinary subject area that spans biology, chemistry, earth sciences, physics, and engineering. Due to this breadth of subject scope, information discovery and sharing in environmental science is often a challenge. This is due to the fact that traditional keyword-based, full-text search is not able to address the more complex information seeking requirements, which include sense-making and exploratory search (Pirolli, 2009). In the latter cases, traditional precision-oriented approaches from the field of Information Retrieval (IR) are not sufficient. For exploratory search, in particular, recall is paramount, as well as the ability to carry out interactive retrieval (Pirolli, 2009). Linked Open Data (LOD), when coupled with semantic enrichment and search methods, offers an opportunity to improve the process of information discovery through enriching and contextualizing scientific publications with respect to unique, machine-readable, interlinked open vocabularies. In particular, semantic search over documents aims to address these challenges by finding information that is not based just on the presence of words, but also on their meaning (Kiryakov, et al., 2004). Relevant LOD vocabularies for environmental science are already becoming available (e.g. the GEMET thesaurus), as are other key resources relevant for the domain (e.g. GeoNames, DBpedia). Manual enrichment of article metadata and textual content with knowledge from LOD resources, however, is prohibitively expensive and unsustainable, since LOD vocabularies typically have millions of entries. Therefore, automatic LOD-based semantic annotation methods were used, in order to enrich the full text content with disambiguated domain terms and entities (e.g. locations, organisations, persons), described through Unique Resource Identifiers (URIs). In addition, the original articles are enriched with relevant knowledge from the respective LOD resources (e.g. that Oxford is part of England). This is needed, in order to answer queries that require common-sense knowledge, which is often not present in the original article content. For example, following semantic enrichment, a semantic search for documents on flooding in England will now able to retrieve a relevant document about floods in Oxford, even though the original text does not explicitly mention England. Designing easy to use and learn semantic search interfaces, however, is both a key requirement and a major challenge (Bast, et al., 2013). The interface needs to be not only more powerful than traditional full-text search over publications, but also to be simple enough for non-expert users. The novel contributions of this paper are threefold:
2 BackgroundWithin the sphere of environmental science, the area with the greatest legacy of semantic enrichment is that of geospatial information (Janowicz, et al., 2013), with applications including GIS environments/Spatial Data infrastructures (SDI), environmental sensor networks and geotagging (Pilman, et al., 2011). These approaches all identify interdisciplinary datasets, as are commonly found in environmental science, as a particularly fruitful area for LOD exploration. In these contexts, dataset metadata is semantically enriched in order to improve search and enable correct use of data (Schentz, et al., 2011). The LOD GEMET thesaurus underpins the EU INSPIRE directive, which aims to establish a digital infrastructure for spatial information in Europe in order to support environmental research, policy and decision-making. This ties into the Open Data movement and data.gov.uk which is being used as a vehicle through which the UK might comply with INSPIRE requirements for making environmental data available and discoverable (Shaon, et al., 2011). Although progress is being made in environmental informatics with respect to enabling the discovery and better use of datasets and geographic information within the GIS/SDI context, LOD vocabularies have not as yet been applied in the context of semantic enrichment of environmental science literature. This contrasts with the biomedical sciences where text mining has been enabled by the Unified Medical Language System, a meta-thesaurus provided by the US National Library of Medicine, which acts as a comprehensive thesaurus and ontology of biomedical concepts (Hettne, et al., 2010). In more detail, we experimented with existing environmental Linked Data vocabularies, namely GEMET and the Ordnance Survey Hydrology ontologies (Devaraju & Kuhn, 2010), as well as two general purpose LOD resources (DBpedia (Bizer et al., 2009) and GeoNames). These were used as knowledge sources for automated semantic enrichment of environmental science literature, coupled with a semantic search user interface. 3 LOD-based Semantic EnrichmentSemantic annotation is the process of tying semantic models, such as ontologies, and scientific articles together. It may be characterized as the dynamic semantic enrichment of unstructured and semi-structured documents with new knowledge and linking these to relevant domain ontologies/knowledge bases. It typically requires annotating a potentially ambiguous entity mention (e.g. Cambridge) with the canonical identifier of the correct unique entity (e.g. depending on the document content, http://dbpedia.org/resource/Cambridge or http://dbpedia.org/resource/Cambridge,_Massachusetts). In our experiments, domain-specific LOD resources, such as the GEMET thesaurus and the Ordnance Survey Hydrology ontology (Devaraju & Kuhn, 2010) were used as a source of relevant terms, with which to enrich the article metadata and also to aid semantic search by providing synonyms. Occurrences of such terms were annotated automatically, using a combination of the GATE open-source English morphological analyser (Cunningham, et al., 2011) to detect the root word forms, and the ontology-based OntoRoot gazetteer, which matches terms using their labels in the thesauri (Cunningham, et al., 2011). Since some of the most frequently used searches are for persons, locations, organisations, and other named entities (Pound, et al., 2010), we also used YODIE (Damljanovic & Bontcheva, 2012) to identify such entities mentioned in the article full-text and disambiguate them to DBpedia URIs. YODIE uses a combination of four classes of similarity metrics: string similarity, semantic similarity between nearby entities, contextual similarity between the document and the textual abstract of the candidate URI in DBpedia, and URI commonness as anchor text in Wikipedia articles. This system uses the GATE tokeniser, POS tagger, and the ANNIE NER system (Cunningham, et al., 2013) for linguistic pre-processing and entity recognition respectively. YODIE also uses the open-source GATE Large Knowledge Gazetteer (LKB) (Cunningham, et al., 2011), which assigns candidate DBpedia URIs to entities mentioned in the text. The result of the semantic annotation and entity disambiguation algorithm are full text articles, enriched with URIs one URI per term or named entity mentioned. Once the URIs are added as annotations, the article texts are enriched with additional semantic knowledge from the respective LOD resource. For instance, if a document mentions Cambridge, once it is disambiguated to http://dbpedia.org/resource/Cambridge (i.e., the English university city), a new annotation will be added to the text, containing the latitude and longitude, country, county, and population information, as given in DBpedia. This additional knowledge enables, inter alia, better location-based searches. For example, a user searching for publications on flooding in East Anglia will now be able to find a report about flooding in Cambridge. In total, 10,000 environmental science documents and associated metadata were enriched automatically with term and entity URIs from DBpedia, GeoNames, GEMET, and the Ordnance Survey ontology, as well as with linguistic information, such as part of speech. 4 Impact of Semantic Enrichment and Search on Information DiscoveryIn order to scope requirements for the semantic search tool, it was important to understand the needs and search behaviour of its potential users. Users were contacted via personal contacts and environmental science networks. A total of 34 respondents answered, which could be split into Local Authority, Consultancy, Academia, NGO/Charity, Government Agency and SME (business). There is a slight emphasis in responses from local authorities due to the survey being posted on the FlowNet website, which provides resources and a point of interaction for that group. The results of the user requirements scoping are detailed in (Kieniewicz & Wallis, 2013). Based on the survey results, environmental science researchers from within The British Library and HR Wallingford carried out information discovery searches on the semantically enriched metadata records and full-text documents. The purpose of this small scale user assessment was to gain insight into how semantic enrichment and semantic search can improve information discovery. In particular, we examined:
4.1 Impact of Semantic Enrichment on Article MetadataThe automatically added LOD-based semantic annotations were manually checked in each of the documents, to assess their accuracy and relevance to the types of searches requested by the environmental science researchers in our survey (Kieniewicz & Wallis, 2013). The focus was on enhancing the article metadata by populating the Dublin Core (Weibel, et al., 1998) Subject field. The benefit of semantic enrichment in this case, is that by surfacing annotated terms derived from the full-text content, concepts buried within the body of the paper/report can be highlighted. The addition of terms affects the relevance ranking in full-text searches. Moreover, searches can be made more specific by limiting the search criteria to the Subject field (e.g. through faceted search). This is similar in principle to the use of Medical Subject Headings (MeSH) (NLM, 1960) within the Medline and PubMed databases, where the content of the original document is described through the use of key terms added to the bibliographic record. For each semantically annotated full-text document, the metadata enrichment algorithm retained the top five locations and organisations with DBpedia entity URIs and the corresponding location-related knowledge. Domain-specific terms were also added to the metadata, on the basis of the environmental science ontologies. This automatically acquired metadata was incorporated into the Subject fields of the document (see the highlighted terms at the bottom of Figure 1). |