D-Lib MagazineMarch/April 2016 Transforming User Knowledge into Archival Knowledge
Tarvo Kärberg AbstractThe users of archives have a vast body of knowledge about the records held in the archives. Some users have participated in the events, some have known the persons involved, some are experts in the subject matters. Their knowledge can cover important gaps that exist in archived knowledge. The objective of this paper is to explore methods to capture user knowledge and transform it to archival knowledge. We investigate the theory and describe the practical implementation developed at the National Archives of Estonia (NAE). In the theoretical part, we discuss the concept of knowledge and analyse its handling in the Open Archival Information System (OAIS) reference model. Having a clear focus on practical usability we identify a way to improve OAIS: complementing the model with a new link between Access and Data Management functional entities enables more efficient updating of descriptive information. Moving to practice, we describe the initial situation at the NAE, the needs, the approach taken and the solution the new Archival Information System AIS 2.0. We explain key aspects of the system, including the use of 5-star Open Data in attaining the goal of knowledge transformation. Keywords: Digital Preservation, Open Data, Knowledge, Crowdsourcing, OAIS 1 IntroductionThe Open Archival Information System (OAIS) defines long term preservation as the act of maintaining information independently understandable by a designated community (OAIS 2012, page 1-13). It can be very difficult to achieve this in practice, as the information may have been insufficiently described and structured during pre-ingest or ingest for a number of reasons. For example, if the producer organisation no longer existed at the time of archiving, the desired quality level for submission might have been impossible to reach. Another reason could be related to the available resources. If the producer is unable to devote sufficient resources to transfer properly, but despite that, the archives has interest (or is obliged) to acquire the records, then a possible compromise is transferring the information in less than ideal quality. And indeed, archival organisations have acquired records at highly varying levels of quality. Therefore, it makes sense to distinguish between three basic terms: data, information and knowledge, as in the DIKW (data, information, knowledge and wisdom) model. Some of the material in archival holdings is just pieces of content discrete facts without explicit relations (Hicks, Dattero, Galup, 2006, page 19) that can be considered simple data. Some parts of holdings can be seen as information (content, which has relations, an aggregation of data) (Vijayakumaran Nair, Vinod Chandra, 2014, page 70) and some parts may be even called (recorded) knowledge if they are interconnected (Quisbert, Korenkova, Hägerfors, 2009, page 14). This distinction may not be fully approved by all communities of information and archival sciences as there is no commonly agreed definition of knowledge (Harorimana, Watkins, 2008, page 853) and it may be argued that knowledge can only exist in the human mind (Eardley, Uden, 2011, page 17), but in the context of this paper we follow the spirit of OAIS, which considers it possible to incorporate a knowledge base both in a person and in a system (OAIS 2012, page 1-12), implying that it is possible to transfer elements of that knowledge base between persons and systems. Another reason to have a distinction between these terms is that it provides more structure and clarity to understanding the complexity of digital preservation. By moving towards knowledge (i.e. complementing simple data and information with contextual linking and organisation), we can gain a better overview of the content of archival collections, which in turn allows us to build better (faster, more accurate, user-friendly, personalised, etc.) access solutions and to provide multifaceted access to the archived knowledge. While metadata is crucial for digital projects, its creation can be labour intensive and time-consuming (Yakel, 2007). As with any metadata, contextual links between records are relatively easy to produce by the creator at the time of creation. The further from creation, the costlier it becomes to achieve, up to the point where (for most of the collections at the archives) it is practically impossible. Archival institutions, especially the ones with a constant flow of new acquisitions, simply lack the staff to process all their vast holdings. Another overwhelming challenge is the depth and width of expertise required for enriching the descriptions. A good example from Estonia is EÜE Estonian Students' Construction Corps (fonds EFA.399 and ERAF.9591 at the National Archives of Estonia). EÜE was an important student organisation during the second half of the Soviet era (1964-1991) it was a medium for exchange of ideas, a hub of counter culture, a management training camp for future leaders, a club for forming friendships that led to the creation of political parties and businesses. The collections contain thousands of photos, most of which lack proper descriptions. Archivists have insufficient knowledge to properly describe the people, places and activities depicted on these photos. It is thus reasonable to turn to the users, i.e. to "crowd source" the knowledge, as the users and archivists together can be more knowledgeable about the archival materials than an archivist alone can be (Huvila, 2007, page 26). Crowdsourcing enables description of content to take place at a detailed level of granularity across a broad range of subjects and collections (Eveleigh, 2014, page 212). And the time is ripe for crowdsourcing: the relation between archives and users is said to have developed from mediation to collaboration (Yakel, 2011, page 257). 2 Archival Software at the National Archives of EstoniaThe National Archives of Estonia had an ecosystem of archival software and hardware designed to comply with OAIS. The same system was used for managing analogue and digital records (with the obvious media-specific differences). The catalogue tools were media-agnostic processing of archival descriptions for analogue records was done using the tools designed with digital records in mind. NAE had an electronic archival catalogue, AIS (Archival Information System), with at least the mandatory elements from ISAD(G) filled for all descriptive units (International Council on Archives, 2000). The archival descriptions typically had the following characteristics:
There were also media specific catalogue/access systems for photos (FOTIS), video and audio (FIS), maps etc., which also existed in isolation. The need for more flexible access had been stated both by the archivists and external users. Regarding crowdsourcing, NAE had some positive experience. The earliest step towards crowdsourcing had been made in December 2004 with the launch of the web user interface of AIS. In AIS, a link for giving feedback was placed on every view and this has consistently created about 100 submissions per year, most of them proposals for fixing errors or otherwise improving elements of description. When processing these proposals, the archivist saw the URL to the AIS web interface, but fixing the data required logging in to a separate administrator interface, manually navigating to the record and redacting the necessary elements of description. Two more experiences were from specialised crowdsourcing projects: The Name Registry and Digitalgud. The Name Registry is a simple web application for indexing names in church records, plus a form for doing searches on the data collected this way. Digitalgud is a two-week photo tagging campaign that takes place every spring on Facebook and is organised by a group of memory institutions. While all three experiences were positive, there was still a clear understanding of scalability issues. For the AIS feedback form the issue was with the complicated user interface that wasted archivists' time. The Name Registry was free from this limitation, as it was developed with efficiency in mind, but the high development costs were only reasonable for a one-off experiment, not for frequent creation of new crowdsourcing projects. In the case of Digitalgud the development cost was negligible, as the whole solution consisted of a simple Facebook page, but consequently the administration costs were high photos had to be manually uploaded to Facebook and user contributions received had to be manually entered into archival catalogues. In all cases, the key complexity laid in integration with the existing catalogue systems: how to get the records from catalogues into the crowdsourcing application and how to integrate the user contributions to the catalogues. In sum, NAE had:
A decision was made to develop a new central catalogue (AIS version 2.0) with faceted classification and crowdsourcing designed into the core a system that would facilitate knowledge transformation2 in all possible ways. 3 User-to-archive Knowledge Transformation in OAISTo build the theoretical foundation for such knowledge transformation we need a means to take user input and use it to update archived information. OAIS acknowledges the need to allow users to complement and update the existing information: "It is important that an OAIS's Ingest and internal data models are sufficiently flexible to incorporate these new descriptions so the general user community can benefit from the research efforts." (OAIS, 2012, page 4-53). All descriptive data is handled in Data Management and Archival Storage functional entities. The Data Management Functional Entity (DMFE) does not strictly provide anything knowledge-specific, but it encompasses the general logic for updating archival holdings. DMFE is responsible for performing archival database updates. The updates include loading new descriptive information and archiving administrative data as seen in Figure 1. Figure 1: Data Management Functional Entity (OAIS, 2012, page 4-10) DMFE includes a Receive Database Update function, which allows for adding, modifying or deleting information in the Data Management's persistent storage. According to OAIS, "The main sources of updates are Ingest, which provides Descriptive Information for the new AIPs (Archival Information Packages), and Administration, which provides system updates and review updates." (OAIS, 2012, page 4-11). The Administration entity is not relevant to our discussion as it deals with the system-related information (generated by periodic reviewing), not the descriptive information of archival holdings. When we look at the enrichment possibilities more closely, we will notice that the Ingest Functional Entity is expected to coordinate the updates between Data Management and Archival Storage (OAIS, 2012, page 4-53). In practice, however, this may involve several complications:
These reasons can in some cases make the Ingest functional entity not scalable enough for updating descriptions via crowdsourcing. We acknowledge that the complications are not critical for every use case, but for the needs of NAE it made sense to look for a more efficient solution. We complemented the OAIS model by adding a direct connection from the Access Functional Entity to the "Receive Database Updates" function in DMFE (see the bold dashed arrow in Figure 2). Figure 2: Updated version of the Data Management Functional Entity (the added pathway is the bold dashed arrow) The access entity should act as a gateway in interaction with external users. We select and send the information to crowdsourcing in the access entity and also receive the enriched information back through the same channel. Enriched information will be sent via direct connection from the Access Functional Entity to the Receive Database Updates service in DMFE for further actions. This complemented approach for database updates is more intuitive and requires fewer resources than using the full OAIS Ingest workflow. This modification allowed us to entirely bypass the traditional ingest workflows and design the crowdsourcing processes with only the steps that are absolutely necessary. 4 ApproachThe NAE designed the following plan for creating the tools and processes to allow external parties to contribute to the descriptions of archived information (Figure 3). Figure 3: Chosen approach
4.1 Generating Persistent Uniform Resource Identifiers (PURIs)The NAE decided to create PURIs because they allow unique identification of any resource, independent of the specific information system that holds it. Thereby PURIs improve data longevity they allow replacement of system components without breaking any references to the information stored in those components. Another benefit of "PURIfied" information is that it can be aggregated automatically. For instance, the former president of Estonia Lennart Meri was also a writer and film maker, so there are records about him in the National Archives (the fonds of the Chancellery of the President of Estonia), in the National Library and in the Film Museum. If all of these institutions use standardised PURIs, each one of them can provide direct links to the related information in other institutions. If, in addition to using PURIs, the information is published as linked open data (see the next chapter on machine-readable open data), not only the links (PURIs) can be provided, but the actual data can be pulled from other institutions and displayed organically next to the institution's own records. Similarly, independent third parties can build so-called mash-up services, e.g. a hobbyist historian can create an online database of famous Estonians that gathers all the available information from archives, museums and libraries and presents it in an interesting way. The importance of URIs that are stable in the long term has also been highlighted by the PRELIDA project as one of the burning issues in the digital preservation of linked data (Batsakis, et al., 2014, page 8). In constructing the PURIs we took into account the guidelines and best practices for semantic interoperability (Archer, Goedertier, Loutas, 2012)3 published by the ISA (Interoperability Solutions for European Public Administrations) which were, in brief:
Example: http://{domain}/{type}/{concept}/{reference}
Example: if schools are already assigned integer identifiers, those identifiers should be incorporated into the URI http://education.data.example/id/school/123456
A PURI should refer to a conceptual resource, which in turn can have representations in different formats. Example: the conceptual resource could be identified as http://data.example.org/doc/foo/bar, while its HTML representation could be http://data.example.org/doc/foo/bar.html and RDF representation http://data.example.org/doc/foo/bar.rdf. The latter two are not PURIs, they are resources that are automatically returned to the user who queries the PURI of the conceptual resource, depending on the type of the user (HTML for humans, RDF for machines).
URIs that identify real world objects that cannot be transmitted as a series of bytes (such as places and people) should redirect using HTTP response code 303 to a document that describes the object. Example: http://www.example.com/id/alice_brown could 303-redirect to a page about Alice Brown.
The service that resolves the conceptual PURIs to the URLs of actual information systems should be independent of the data originator. By following these guidelines, we developed PURIs for various content types (Table 1). Table 1: List of PURIs by resource type
For interoperability reasons, all PURIs were agreed to follow these rules:
4.2 Publishing Archival Descriptions in Machine Readable FormThe NAE chose to publish its restriction-free digital assets as linked open data, i.e. in a machine readable form and under an open license. It was a natural decision since the archival law in Estonia sets all archival holdings openly accessible by default. Only a few per cent of the archival records at NAE have access restrictions that stem from data privacy laws and other acts. This open availability covers both the archival descriptions and the digital usage copies (about 15 million files, mostly digitised paper records). The NAE saw open data as an opportunity to better structure and link the archival holdings and a way to facilitate the transformation of users' knowledge into archival knowledge. The view that linked data are a form of formal knowledge was also highlighted by the PRELIDA project (Batsakis, et al., 2014, page 40). It was decided to observe the five star categorisation model introduced by Tim Berners-Lee (see Figure 4). The model has five levels, each of them describing a set of openness characteristics for data published on the Web. The lowest, one-star level of openness describes any data that is available on the Web under an open license, but is not structured nor easily re-usable. The highest level of five stars is the ultimate in openness: the data is well structured, presented in an open format, described using URIs and linked to other data to provide context. Figure 4: Open-data star model according to Tim Berners-Lee4 As referred earlier, the NAE decided to present its archival descriptions in RDF (Resource Description Framework) format. Several additional specific standards (ARCH, BIBO, DCPeriod, Dublin Core, FOAF, LOCAH, MODS, OWL, RDFS, SKOS, VCARD) were also incorporated to improve interoperability with other institutions all over the world. The choice to use RDF was made because it:
All taxonomies (time periods, persons/organisations, subjects, topics, places) were also published as open data. The taxonomies are:
All published descriptions were also uploaded to a public Web portal to better satisfy the needs of the users of open data. The published archival descriptions include descriptive information of all archival records, no matter whether they are on paper, on magnetic tape or on some other media. The descriptions are hierarchical, starting from fonds level, continuing with series and subseries, and concluding with files and items as proposed in ISAD(G) standard (International Council on Archives, 2000). The descriptions available on higher levels are not repeated on lower levels. In addition to RDF, open data was also published in apeEAD format, which is an elaboration of the EAD standard, created by the Archives Portal Europe project. The format allows gathering all descriptive info of a fonds into one XML file. The published archival descriptions are available under CC0 licence and digital content under CC-BY-SA licence. 4.3 Developing Software To Select Records for CrowdsourcingThe most logical place for sending information to the specialised crowdsourcing tool is the central archival catalogue, AIS, as it contains all the archival descriptions that can become subject to enrichment. For that reason, a functionality was added to AIS, which allows an archivist to select items in search results and to put them into a special list for crowdsourcing (see Figure 5). Figure 5: Selecting items and metadata for crowdsourcing The list can be then sent to external tools for crowdsourcing. The archival access portal provides access to crowdsourcing solutions via HTTP/RDF or SOAP (Simple Object Access Protocol) service. While the HTTP is meant for human browsing, the RDF and SOAP service is more suitable for software applications. 4.4 Developing Crowdsourcing Tools for End UsersAnna If the list described above is sent to the crowdsourcing tool Anna5 then some additional information should be provided. The archivist can select the target group (including the choice whether to allow participation of users who are not logged in), define and explain the scope of the task (select objects and metadata fields for crowdsourcing) and change the status of the task to public or close it, as seen in Figure 6. Figure 6: Setting up a crowdsourcing task Anna allows everyone to contribute to the enrichment process of archival holdings (Figure 7). Contributing is very simple. The user is first presented with the option to log in (login is not mandatory) and select a suitable task. There can be a variety of tasks: ones that ask people to identify the place where the presented photo was taken, others that call for identification of persons on the photos, etc.6 The tasks can also be designed so that the user can select the places or persons from the taxonomy (thereby adding the PURI of the taxonomy item into the metadata of the photo) or if necessary, add new items to the taxonomy (thereby creating new PURIs). Selection from taxonomies is done by searching and browsing, so the user is mostly spared from the complexity of taxonomies and PURIs. Figure 7: Portal Anna The tool also has a motivation system which adds a playful, competitive touch to the crowdsourcing. Leader boards and rankings tend to motivate users to participate more, as people want to advance in the rankings and become more recognised in the archival community. The user score calculated in Anna is not a sum of hours worked or proposals made. It is a sum of points a user earns each time one of their proposals gets approved by an archivist. Ajapaik Anna is not the only option for a crowdsourcing tool the same methods of task preparation can be used to work with external tools. The NAE has tested this integration with the mashup portal Ajapaik. The AIS at NAE has the functionality for selecting photos and forwarding the selected set to Ajapaik (Figure 8). Ajapaik allows adding tags (geo-coordinates, camera position, place names) to the photos. These contributions are aggregated using an algorithm that considers users' trustworthiness (based on the accuracy of their previous contributions) and then averages the geo-coordinates. A photo is considered geotagged when enough trustworthy users tag the photo into approximately the same location. Users are then encouraged to visit the location, take a new photo that matches the angle and composition and upload it to Ajapaik, so that an old and a new view can be seen side by side. The metadata collected through this process can be sent to the AIS. Figure 8: Portal Ajapaik The portal aims to captivate users by allowing them to add geotags and rephotograph objects, giving the unique opportunity to see what places looked like many years ago and exactly where the photos were taken. The data exchange protocol for external crowdsourcing tools is based on RDF and PURIs, so the user knowledge can be captured in a well structured form. For example, the geographic coordinates that the users of Ajapaik contribute by simply clicking on the interactive map, can be returned to the NAE's Anna/AIS system in the form of standardised PURIs. These PURIs can later be used to pull together information related to this geographic spot from any online source that also uses standardised PURIs for geo-coordinates. In a similar manner, Ajapaik will store the PURIs it got from the archives in the beginning of the process and can provide its users with links to the related information in NAE's online services. 4.5 Developing Functionality for Receiving Crowdsourced InformationUser contributions are received by AIS via a simple SOAP service so it is easy to build this communication capability into new crowdsourcing tools. Inside AIS, the newly imported data is instantly published next to the official metadata, but clearly labeled as "unofficial contributions from users." The same data is also presented to the archivists as a list of proposals to be reviewed (see Figure 9). Figure 9: List of Proposals The archivist can examine each proposal in detail (Figure 10) and then reject it, approve it or leave it as it is. Rejection deletes the proposal, while approval makes it part of the official archival descriptions. Rejections and approvals can also be done in batches, to spare the archivist from excessive clicking. Figure 10: Detailed view of a proposal 5 Combining Components in the Workflow5.1 Curated CrowdsourcingThe general business process workflow for crowdsourcing at the NAE can be seen in Figure 11. Figure 11: Business process workflow for crowdsourcing The workflow covers the following activities:
5.2 Spontaneous CrowdsourcingThe NAE has noticed that some users do not want to use crowdsourcing tools while they are still willing to share their knowledge if it can be done quickly within their normal work environment. The NAE developed an alternative crowdsourcing workflow that allows users to propose changes to nearly all elements of archival description. Figure 12: The first few fields of the spontaneous crowdsourcing form As mentioned above, a simple version of such functionality had been present in AIS version 1 and FOTIS in the form of a "report error" link. The user was provided just one plain text field to describe whatever suggestions they had. While easy to use, this solution was very laborious for the archivists, so there was demand for a more automated process. The new method allows structured proposals, i.e. new values can be suggested granularly, for any individual data field present in AIS. As some elements of description cannot take input from anyone but archivists (e.g. reference codes), it makes sense to hide them from the proposal form. This is why the first two steps of the workflow in Figure 13 are necessary they set the initial configuration of the system. Figure 13: Business process workflow for spontaneous crowdsourcing The workflow consists of the following steps:
6 ConclusionsOur research and practical experimentation confirmed that it is possible to complement the OAIS Data Management Functional Entity (DMFE) to support the descriptive information update more widely and effectively we can now acquire the contributions of external organisations and individuals using only the necessary specific steps instead of the full complexity of ingest entity. Simultaneously, it became clear that there is synergy between 5-star open data and transformation of user knowledge. Linked open data serves as a great communication interface between crowdsourcing tools, plus, through its inherent linkability helps create attractive mashed-up crowdsourcing environments. In turn, the user contributed knowledge and work hours are a significant force in filling the gaps in archival knowledge and thus improve the quality of linked open data. It is reasonable to involve external parties in creating and managing the crowdsourcing solutions. There are several advantages. One of them is that the existing solutions may already have their own loyal user community, which means that the archives do not need to spend additional resources to invite people to the crowdsourcing environment. The described information system was not yet live at the time of print, but the technical tests and experiments performed gave ample reasons to predict good results. As the new solution allows collecting additional descriptive information for multiple dimensions from multiple sources, it gives the NAE the opportunity to receive descriptions in different granularity and quality. The solution is in essence data independent, so it is possible to build very complex relations between information entities without any restricting constraints declared by data types. Despite the numerous successful technical experiments, real life pilot projects have to be performed as well. Exposing the solution to practice will reveal if the users accept the tools we developed. The following list summarises the main values of this effort for the archival community.
Further studies should look into developing standardised ontologies (to improve interoperability of linked data) and standardising the means of communication between catalogue and crowdsourcing systems. Notes
References
About the Authors
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|