D-Lib MagazineNovember/December 2015 Structured Affiliations Extraction from Scientific LiteratureDominika Tkaczyk, Bartosz Tarnawski and Łukasz Bolikowski AbstractCERMINE is a comprehensive open source system for extracting structured metadata from scientific articles in a born-digital form. Among other information, CERMINE is able to extract authors and affiliations of a given publication, establish relations between them and present extracted metadata in a structured, machine-readable form. Affiliations extraction is based on a modular workflow and utilizes supervised machine learning as well as heuristic-based techniques. According to the evaluation we performed, the algorithm achieved good results both in affiliations extraction (84.3% F1) and affiliations parsing (92.1% accuracy) tasks. In this paper we outline the overall affiliations extraction work flow and provide details about individual steps' implementations. We also compare our approach to similar solutions, thoroughly describe the evaluation methodology and report its results. The CERMINE system, including the entire affiliations extraction and parsing functionality, is available under an open-source licence. Keywords: Metadata Extraction, Scientific Literature Analysis, Content Classification, Affiliations Extraction, Affiliations Parsing 1 IntroductionAcademic literature is a very important communication channel in the scientific world. Keeping track of the latest scientific findings and achievements, usually published in journals or conference proceedings, is a crucial aspect of typical research work. Ignoring this can result in deficiencies in the knowledge related to latest discoveries and trends, which in turn can lower the quality of the research, make results assessment much harder and significantly limit the possibility of finding new interesting research challenges. Unfortunately, studying scientific literature, and in particular being up-to-date with the latest positions, is difficult and extremely time-consuming. The main reason for this is the very large and constantly growing volume of scientific resources, as well as the fact that publications are predominantly made available as unstructured text. Modern digital libraries support the process of studying the literature by storing and presenting document collections not only in the form of raw sources of the documents, but accompanied with information related to citations, authors, organizations, research centres, projects, datasets, and also relations of various nature. Machine-readable data describing scholarly communication allows to build useful tools for intelligent search, detecting similar and related documents and authors, the assessment of the achievements of individual authors and entire organizations, identifying people and teams with a given research profile, and many more tools. In order to provide such high quality services a digital library requires access to the rich set of stored documents' metadata, including information such as authors and their institutions or countries. Unfortunately in practice, good quality metadata is not always available, is missing or fragmentary sometimes, or full of errors. In such cases the library needs an automatic method of extracting the metadata from documents at hand. In this paper we propose a reliable method for extracting structured affiliations from scientific articles in PDF format. The algorithm is able to perform the following tasks:
The method is based on a modular work flow, which makes it easy to improve or replace one task implementation without changing other parts of the solution. We make extensive use of supervised machine-learning techniques, which increases the maintainability of the solution, as well as its ability to adapt to new document layouts. Affiliation extraction and parsing functionality described in the article is part of CERMINE [14] our open-source tool for automatic metadata and bibliography extraction from born-digital scientific literature. CERMINE web service, as well as the source code, can be accessed online [12]. In the following sections we describe the state of the art (Section 2), provide the details about the overall work flow architecture and individual implementations (Section 3) and finally report the evaluation methodology and its results (Section 4). 2 State of the ArtEven limited to analysing scientific literature only, the problem of extracting a document's authors and affiliations remains difficult and challenging, mainly due to the vast diversity of possible layouts and styles used in articles. In different documents the same type of information can be displayed in different places using a variety of formatting styles and fonts. For instance, a random subset of 125,000 documents from PubMed Central contains publications from nearly 500 different publishers, many of which use original layouts and styles in their articles. What is more, the PDF format, which is the most popular for storing source documents, does not preserve the information related to the document's structure, such as words and paragraphs, lists and enumerations, or the reading order of the text. This information has to be reverse engineered based on the text content and the way the text is displayed in the source file. Nevertheless, there exist a number of methods and tools for extracting metadata from scientific literature. They differ in availability, licenses, the scope of extracted information, algorithms and approaches used, input formats and performance. Han et al. [5] describe a machine learning-based approach for extracting metadata, including authors and affiliations, from the headers of scientific papers. The proposed method is designed for analysing textual documents and employs a two-stage classification of text lines with the use of Support Vector Machines and text-related features. PDFX [2] is a rule-based system for converting scholarly articles in PDF format to XML representation by annotating fragments of the input documents. PDFX extracts basic metadata and structured full text, but unfortunately omits affiliation-related information. What is more, the system is closed source and available only as a web service. GROBID [8] is a machine learning-based library focusing on analysing scientific texts in PDF format. GROBID uses Conditional Random Fields in order to extract a document's metadata (including parsed affiliations), full text and parsed bibliographic references. The library is open source. SectLabel, a module from the ParsCit project [9, 3], also uses Conditional Random Fields to extract the logical structure of scientific articles, including the document's metadata, structured full text and parsed bibliography. The tool is able to extract affiliations, but does not parse them. ParsCit analyses documents in text format, and therefore does not use geometric hints present in the PDF files. The tool is available as open source. Extracting and parsing affiliations is also provided by the NEMO system [6]. NEMO does not process PDF files directly, but rather extracts information from XML. What is more, affiliation parsing is done using regular expressions and dictionaries, which limits the possibility of adapting the solution to new formats and styles. Enlil [4] is another tool able to extract authors, affiliations and relations between them from scientific publications. The system uses SectLabel [9] to detect authors and affiliations in the article's text, while splitting authors and affiliations lists, and associating authors with affiliations, are done by CRF and SVM classifiers. Unfortunately, Enlil does not perform affiliation parsing. In terms of functionality, input format and methods used, our solution is most similar to GROBID. CERMINE extracts affiliations directly from PDF files, assigns them to extracted authors and also performs affiliation parsing. Extraction and parsing are done primarily by machine learning-based algorithms, which makes the system more adaptable to new layouts and styles. The web service, as well as the source code, are publicly available. We compare our solution to other tools with respect to author and affiliation extraction performance. (See Section 4.) 3 Author and Affiliation Extraction WorkflowThe affiliation extraction algorithm accepts a scientific publication in PDF format as input and extracts authors and affiliations in structured form. The following is an example fragment of the algorithm's output in NLM JATS format:<contrib-group> The extraction work flow is composed of the following steps, as shown in Figure 1:
Table 1 shows the decomposition of the extraction work flow into individual steps and provides basic information about tools and algorithms used for each step.
Table 1: The decomposition of the extraction work flow into individual steps. 3.1 Document structure extractionStructure extraction is the initial phase of the entire work flow. Its goal is to create a hierarchical structure of the document preserving the entire text content of the input document and also features related to the way the text is displayed in the PDF file. More information about the implementation details can be found in [14]. Basic structure extraction is composed of three steps:
The purpose of character extraction is to extract individual characters from the PDF stream along with their positions on the page and dimensions. The implementation is based on the open-source iText library. We use iText to iterate over a PDF's text-showing operators. During the iteration we extract text strings along with their size and position on the page. Next, extracted strings are split into individual characters and their individual widths and positions are calculated. The result is an initial flat structure of the document, which consists only of pages and characters. The widths and heights computed for individual characters are approximate and can differ slightly from the exact values depending on the font, style and characters used. Fortunately, those approximate values are sufficient for further steps. The goal of page segmentation is to create a geometric hierarchical structure storing the document's content. As a result the document is represented by a list of pages, each page contains a set of zones, each zone contains a set of text lines, each line contains a set of words, and finally each word contains a set of individual characters. Each object in the structure has its content, position and dimensions. Page segmentation is implemented with the use of a bottom-up Docstrum algorithm [11]:
A PDF file contains, by design, a stream of strings that undergoes an extraction and segmentation process. As a result we obtain pages containing characters grouped into zones, lines and words, all of which form what could be described as an "unsorted bag of items". The aim of setting the reading order is to determine the right sequence in which all the structure elements should be read. An example document page with a reading order of the zones is shown in Figure 3.
A reading order resolving algorithm is based on a bottom-up strategy: first, characters are sorted within words and words within lines horizontally, then lines are sorted vertically within zones, and finally we sort zones. The fundamental principle for sorting zones was taken from PdfMiner. We make use of an observation that the natural reading order in most modern languages descends from top to bottom, if successive zones are aligned vertically, otherwise it traverses from left to right. This observation is reflected in the distances counted for all zone pairs: the distance is calculated using the angle of the slope of the vector connecting zones. As a result zones aligned vertically are in general closer than those aligned horizontally. Then using an algorithm similar to hierarchical clustering methods we build a binary tree by repeatedly joining the closest zones and groups of zones. After that, for every node its children are swapped, if needed. Finally, an in order tree traversal gives the desired zones order. 3.2 Content classificationThe goal of content classification is to label each zone with a functional class. The classification is done in two stages: initial classification assigns general categories (metadata, references, body, other), while the goal of metadata classification is to classify all metadata zones into specific metadata classes (abstract, bib info, type, title, affiliation, author, correspondence, dates, editor, keywords). Both classifiers use Support Vector Machines and their implementation is based on the LibSVM library [1]. The classifiers differ in SVM parameters, but in both cases the best parameters were found by performing a grid-search using a set of 100 documents from PubMed Central Open Access Subset and maximizing mean F-score obtained during a 5-fold cross validation. In order to perform zone classification, each zone is transformed into a vector of feature values, which are to a great extent the same for both classifiers. The initial and metadata classifiers use 83 and 62 features, respectively. The features capture various aspects of the content and surroundings of the zones and can be divided into the following categories:
3.3 Author and affiliation metadata extractionAs a result of classifying the document's fragments, we usually obtain a few regions labelled as author or affiliation. In this step we extract individual author names and affiliations and determine relations between them. In general the implementation is based on heuristics and regular expressions, but the details depend on the article's layout. There are two main styles used in different layouts: (1) the names of all authors are placed in one zone in the form of a list, and similarly affiliations are listed together below the authors' list, at the bottom of the first page or at the end of the document, before the bibliography section (an example is shown in Figure 4), and (2) each author name is placed in a separate zone along with its affiliation and email address (an example is shown in Figure 5).
At the beginning, the algorithm recognizes the type of layout of a given document. If the document contains two zones labelled as affiliation placed horizontally, it is treated as type (2), otherwise as type (1). In the case of a layout of the first type (Figure 4), at the beginning authors' lists are split using a predefined lists of separators. Then we detect affiliation indexes based on predefined lists of symbols and also geometric features. Detected indexes are then used to split affiliation lists and assign affiliations to authors. In the case of a layout of the second type (Figure 5), each author is already assigned to its affiliation by being placed in the same zone. It is therefore enough to split the content of such a zone into author name, affiliation and email address. We assume the first line in the zone is the author name, email is detected based on regular expressions, and the rest is treated as the affiliation string. In the future we plan to implement this step using a supervised token or line classifier. 3.4 Affiliation parsingThe goal of affiliation parsing is to recognize affiliation string fragments related to institution, address and country. Additionally, country names are decorated with their ISO codes. Figure 6 shows an example of a parsed affiliation string. Figure 6: An example of a parsed affiliation string. Colors mark fragments related to institution, address and country. Affiliation parser uses Conditional Random Fields classifier and is built on top of GRMM and MALLET packages [10]. The first affiliation string is tokenized, then each token is classified as institution, address, country or other, and finally neighbouring tokens with the same label are concatenated. The token classifier uses the following features:
The class of a given token depends not only on its features and classes of surrounding tokens, but also on their features. To reflect this in the classifier, the token's feature vector contains not only features of the token itself, but also features of two preceding and two following tokens. 4 EvaluationWe performed the evaluation of the key steps of the algorithm and the entire extraction process as well. The ground truth data used for the evaluation is based on the resources of PubMed Central Open Access Subset. 4.1 Datasets preparationThree datasets were used during the evaluation process. A subset of PMC was used directly to evaluate affiliation extraction work flow. Additionally, PMC served as a base for constructing a GROTOAP2 dataset, which was used for training and evaluating zone classifiers. A separate affiliation dataset was used to train and evaluate the affiliation parser. PubMed Central Open Access Subset contains life sciences publications in PDF format accompanied by corresponding metadata in the form of XML NLM files. NLM files contain a rich set of documents' metadata, structured full text and also documents' bibliography. A subset of 1,943 documents from PMC was directly used to evaluate the entire affiliation extraction process. Unfortunately, NLM files contain only the annotated text of documents and do not preserve geometric features related to the way the text is displayed in PDF files. As a result, PMC could not be used directly for training and evaluation of zone classifiers. For these tasks we built GROTOAP2 [13] a large and diverse dataset containing 13,210 documents from 208 different publishers. The main part of the dataset are semi-automatically created ground truth files in TrueViz format. They contain the entire text content of publications in hierarchical form composed of pages, zones, lines, words and characters, their reading order and zone labels. Affiliation dataset used for parser evaluation contains 8,267 parsed affiliations from PMC documents. Since parsed affiliations in NLM files do not always contain the entire raw affiliation string as it was given in the article, we built the dataset automatically by matching labelled metadata from NLM documents against affiliation strings extracted from PDFs by CERMINE. The dataset is publicly available here. 4.2 Classification evaluationBoth zone classifiers were evaluated by a 5-fold cross validation using a set of zones from 2,551 documents from the GROTOAP2 dataset. The set contains 355,779 zones, 68,557 of which are metadata zones. Tables 2 and 3 show the results for labels of interest: metadata and affiliation.
4.3 Affiliation parsing evaluationAffiliation parser was evaluated by a 5-fold cross validation with the use of 8,267 affiliations from PMC. The confusion matrix for token classification is shown in Table 4.
In addition to evaluating individual token classification, we also checked for how many affiliations the entire fragments (institution, address or country) were labelled correctly. The following results were obtained:
4.4 Affiliation extraction workflow evaluationA subset of 1,943 documents from PMC was used to compare the performance of various extraction systems. We evaluated four tasks:
During the evaluation we used the default version of CERMINE, in which zone classifiers are trained on a subset of the GROTOAP2 dataset and the affiliation parser is trained on our affiliation dataset. Apart from CERMINE, the following systems were evaluated: GROBID, ParsCit and PDFX. We used the default versions of the tools, without retraining them or modifying them in any way. PDFX was executed through its web service. The tools processed articles in PDF format. The only exception was ParsCit, which analyses only the text content of a document, and therefore, in this case, PDF files were first transformed to text using pdftotext tool. The results were obtained by comparing the output provided by the tools with annotated data from NLM files. For every document and every task we compared the lists of extracted and ground truth objects. Author and affiliation strings were compared using cosine distance with a threshold. In the case of author-affiliation relations, a relation was marked as correct if both elements matched. This resulted in individual precision and recall for every document. The overall precision and recall values were computed as mean values over all documents from the dataset. The evaluation results are shown in Figure 7. In both author and affiliation extraction tasks CERMINE achieved the best F-score (89.2% and 84.3%, respectively). In the case of determining author-affiliation relations we compared our solution only to GROBID, as other tools do not extract this information. In both tasks CERMINE achieved better results (F1 63.1% and 77.4% in tasks 3 and 4, respectively).
5 System UsageCERMINE's web service is available online and can be accessed here. The source code is available on GitHub here. The system also provides RESTful services that allow for executing the extraction and parsing tasks by machines, using tools like cURL. Metadata can be extracted from a scientific article in PDF format using the following command:
$ curl -X POST --data-binary @article.pdf \ Affiliation string can be parsed using the following command: $ curl -X POST \ The time needed to extract metadata from a document depends mainly on its number of pages. Figure 8 shows the processing time as a function of the number of pages for 1,238 random documents. The average processing time for this subset was 9.4 seconds. Figure 8: Metadata extraction time as a function of a document's number of pages. Page segmentation and initial zone classification are the most time-consuming steps. By default CERMINE processes the entire input document in order to extract its full text. However, if the client application is interested in document metadata only (for example document title, authors and affiliations), it is sufficient to analyse only the first and last pages of the document as the metadata is rarely present in the middle part. In such cases we recommend restricting the analysis to a fixed number of pages, and as a result even large documents can be processed in a reasonable time. 6 ConclusionThis article described an affiliation extraction method for extracting authors and affiliations from scientific publications in PDF format, associating authors with affiliations, and parsing affiliation strings in order to extract institution, address and country information. The algorithm used is a part of the open source CERMINE system, a tool for extracting structured metadata and content from born-digital scientific literature. According to our evaluation of the key steps of the algortihm, as well as the entire extraction process, we found that our solution performs the best in all of the evaluated tasks. The method is effective and reliable and can be used whenever we do not have access to valuable author and affiliation metadata. Our future plans include comparing a heuristic-based approach with machine learning for zone content splitting and determining author-affiliation relations. AcknowledgementsThis work has been partially supported by the European Commission as part of the FP7 project OpenAIREplus (grant no. 283595). References[1] C. Chang and C. Lin. LIBSVM: A library for support vector machines. ACM TIST, 2(3):27, 2011. http://doi.org/10.1145/1961189.1961199 [2] A. Constantin, S. Pettifer, and A. Voronkov. PDFX: fully-automated pdf-to-xml conversion of scientific literature. In ACM Symposium on Document Engineering, pages 177-180, 2013. http://doi.org/10.1145/2494266.2494271 [3] I. G. Councill, C. L. Giles, and M. Kan. Parscit: an open-source CRF reference string parsing package. In Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco, 2008. [4] H. H. N. Do, M. K. Chandrasekaran, P. S. Cho, and M. Y. Kan. Extracting and matching authors and affliations in scholarly documents. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '13, 2013. http://doi.org/10.1145/2467696.2467703 [5] H. Han, C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. A. Fox. Automatic document metadata extraction using support vector machines. In ACM/IEEE 2003 Joint Conference on Digital Libraries (JCDL 2003), 27-31 May 2003, Houston, Texas, USA, Proceedings, pages 37-48, 2003. http://doi.org/10.1109/JCDL.2003.1204842 [6] S. Jonnalagadda and P. Topham. NEMO: extraction and normalization of organization names from pubmed affliation strings. CoRR, abs/1107.5743, 2011. [7] Chang Ha Lee, Tapas Kanungo. The architecture of TrueViz: a groundTRUth/metadata editing and VIsualiZing ToolKit. Pattern Recognition 36(3): 811-825 (2003) http://doi.org/10.1016/S0031-3203(02)00101-2 [8] P. Lopez. GROBID: combining automatic bibliographic data recognition and term extraction for scholarship publications. In Research and Advanced Technology for Digital Libraries, 13th European Conference, pages 473-474, 2009. http://doi.org/10.1007/978-3-642-04346-8_62 [9] M. Luong, T. D. Nguyen, and M. Kan. Logical structure recovery in scholarly articles with rich document features. IJDLS, 1(4):1-23, 2010. http://doi.org/10.4018/jdls.2010100101 [10] A. K. McCallum. MALLET: A Machine Learning for Language Toolkit. 2002. [11] L. O'Gorman. The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell., 15(11):1162-1173, 1993. http://doi.org/10.1109/34.244677 [12] D. Tkaczyk et al. Cermine: Cermine 1.6, May 2015. http://doi.org/10.5281/zenodo.17594 [13] D. Tkaczyk, P. Szostek, and L. Bolikowski. GROTOAP2 - the methodology of creating a large ground truth dataset of scientific articles. D-Lib Magazine, 2014. http://doi.org/10.1045/november14-tkaczyk [14] D. Tkaczyk, P. Szostek, M. Fedoryszak, P. J. Dendek, and L. Bolikowski. Cermine: automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition (IJDAR), pages 1-19, 2015. http://doi.org/10.1007/s10032-015-0249-8 About the Authors
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|