Articles
spacer

D-Lib Magazine
December 2001

Volume 7 Number 12

ISSN 1082-9873

How Do Physicists Use an E-Print Archive?

Implications for Institutional E-Print Services

 

Stephen Pinfield
Library Services
University of Nottingham, UK
[email protected]

Red Line

spacer

Abstract

It has been suggested that institutional e-print services will become an important way of achieving the wide availability of e-prints across a broad range of subject disciplines. However, as yet there are few exemplars of this sort of service. This paper describes how physicists make use of an established centralized subject-based e-prints service, arXiv (formerly known as the Los Alamos XXX service), and discusses the possible implications of this use for institutional multidisciplinary e-print archives. A number of key points are identified, including technical issues (such as file formats and user interface design), management issues (such as submission procedures and administrative staff support), economic issues (such as installation and support costs), quality issues (such as peer review and quality control criteria), policy issues (such as digital preservation and collection development standards), academic issues (such as scholarly communication cultures and publishing trends), and legal issues (such as copyright and intellectual property rights). These are discussed with reference to the project to set up a pilot institutional e-print service at the University of Nottingham, UK. This project is being used as a pragmatic way of investigating the issues surrounding institutional e-print services, particularly in seeing how flexible the e-prints model actually is and how easily it can adapt itself to disciplines other than physics.

Introduction

At the University of Nottingham, we are in the process of setting up an experimental institutional e-print archive [1]. The technical side of this has been reasonably straightforward. We are using the software produced by e-prints.org [2], and have customized the interface to give it a Nottingham 'look and feel'. However, the managerial and cultural aspects of 'self-archiving' have proved to be rather more complex. How can researchers be encouraged to contribute to the e-print archive? What is in it for them? How can procedures be managed to make it simple for them to archive their material?

To help us address these questions, we decided we needed to know more about how a successful e-print archive is actually being used by researchers. To do this, we chose to look at use of the arXiv service (formerly known as the Los Alamos XXX service) [3]. We consulted (face-to-face or by email) a number of researchers who are making use of arXiv. At the same time, we looked at the service itself to see if what we were being told seemed to reflect the wider experience.

We were interested in finding out more about how arXiv is used so that we could then consider whether any of its features would read across into other e-print services. To what extent is arXiv based on the unique working practices of physicists or is it an exemplar for other potential services? Would a centralized subject-based service such as arXiv have anything to tell us about how researchers might use an institutional multidisciplinary service? If institutional services are to be, as some have suggested, an important way of achieving the wide availability of e-prints across different subject disciplines [4], then it is crucial these questions are investigated.

Pre-print culture

The historical background of the Los Alamos service has been described in detail elsewhere [5]. What needs to be said here is that the archive was originally designed as a way of automating a paper-based process already in existence. This process was the circulation of 'pre-prints'. Pre-prints are pre-refereed, pre-publication papers which report on new research. They were circulated by physicists for two main reasons. Firstly, pre-prints were a good way of establishing priority. In a fast moving field as soon as research results are available, it is important to get the work into the public domain with a researcher's name attached to it. Publication in journals was usually too slow. Pre-prints on the other hand could be produced quickly and circulated immediately by mail to other research centers in order to establish priority. Once papers were circulated, the second reason for the existence of pre-prints came into play. Pre-prints were a way of soliciting comments on the research so that the paper could be refined for 'formal publication'. A kind of informal peer review. On receipt of comments, a paper would then be redrafted for submission to a journal. For the discipline as a whole, pre-prints were a way of reducing the likelihood of unnecessary parallel research by allowing correlations to be quickly identified.

However, paper pre-prints were not entirely satisfactory. For a start, they did not halt all disputes over priority. Another key problem had to do with distribution. Distribution was inevitably limited. Only certain institutions received pre-prints; others (including most institutions in less developed countries) were effectively out of the loop. Researchers in these places were at a distinct disadvantage.

The Los Alamos e-print archive was designed to solve these problems. It clearly establishes priority by date stamping contributions. The e-print archive also widens access. Anyone with access to a networked computer can now look at the pre-print literature. Some have seen the e-print archive as 'democratizing' the scholarly communication process.

The Los Alamos archive was designed for High Energy Physics. Other areas of Physics, Mathematics, Non-linear Sciences and Computer Sciences have since come on board. But these are not the only disciplines to have pre-print traditions. For example, in the completely different fields of Management, Business and Finance, 'working papers' are circulated in a similar way to Physics pre-prints. It is interesting that although the RePEC (Research Papers in Economics) service partly covers these subjects [6], no real equivalent of arXiv has been set up. Why? This question requires further investigation, but part of the answer in this particular field may relate to the fact that in Management, Business and Finance, Business schools have been keen to issue working papers in school series rather than relying on individual writers, and some institutions have actually developed a practice of charging for their working papers [7]. This may have held back resource sharing through self-archiving on the arXiv model.

The process of using arXiv

Bearing in mind the pre-print culture, the current pattern of usage of arXiv seems to look something like this:

  1. A researcher prepares a paper in one of a number of formats accepted on arXiv. The accepted formats are listed on the arXiv Help pages as: TeX/LaTeX/AMSTeX/AMSLaTeX; HTML plus PNG/GIF; PDF; PostScript; or Mathematica Notebook.
  1. The author self-archives the paper on arXiv. This can be done by email (following a pre-determined data structure so that the content can be machine-parsed), FTP or by the using the submission procedure on the web.
  1. Other researchers are then able to read the paper. They may find out about the paper by using the arXiv web interface, or they may be informed of its existence by email if they have subscribed to the email alerting service.
  1. Having read it, other researchers may then comment on the paper by email.
  1. The author may then revise the paper in response to comments and replace the original paper on arXiv with the revised one. The paper may go through a number of iterations in this way.
  1. The revised paper may next be submitted to a journal as well as placed on arXiv. Some journal publishers now even allow submission in the form of an arXiv document number. The referees can go to the paper on arXiv in order to read it.
  1. Following referee's comments the paper is either accepted or rejected by the journal.
  1. If rejected, the paper may be submitted to another journal after any necessary revisions. Revised versions may be included on arXiv.
  1. If accepted, revisions are normally made in response to referee comments and a final revised version of the paper submitted to the journal. It is common practice for this final revised version to also be placed on arXiv.

This account is inevitably based on anecdotal evidence but seems to be representative. Empirical evidence on the usage of the archive is emerging which complements and helps to flesh out this basic outline [8].

Discussion

There are a number of important issues here. Many of them lead to interesting questions about institutional archives and how they might be used. Some of the issues are discussed in what follows.

The format of papers (stage 1 above) is the first issue. The arXiv service will accept a variety of formats, including proprietary ones (PDFs), though not word-processed documents. This was fine for the physicists we talked to. Many of them do not use popular word processing software -- they were working on UNIX boxes not PCs or Macs. But this is not the case for all potential contributors to a multidisciplinary institutional e-print service. Many authors in other subject areas would have difficulty producing text in any of the formats specified by arXiv, even HTML. It is clear that institutional archive administrators have to give careful consideration to what formats they will accept from authors, what formats they will allow on the e-print server and whether they can, if necessary, convert from one format to another [9]. This is not, of course, a question just for the here and now. Long term preservation considerations are important as well. All archives need to address questions of the long-term viability of formats and how digital preservation measures can be put in place. In the short term, in order to attract authors from all disciplines, we believe at Nottingham we may have to do some 'hand holding' in this area, at least in the early stages of an institutional archive.

The same might be said of the submission process (stage 2). The arXiv service relies on self-submission. All of the options for self-submission assume a basic level of IT literacy (a reasonable assumption perhaps for most physicists). However, once again administrators of a multidisciplinary institutional archive may not be able to make such an assumption about their users. In any institution there is an enormous range of IT literacy both between and within departments. The self-submission process on the e-prints.org software involves five or six major stages. Some potential contributors may not have the know-how or the patience to submit their documents themselves, especially as they would only be doing it irregularly. It has been suggested that, in the initial stages of the implementation of institutional archive, it may be best to smooth the path of submission by having the archive administrator deposit papers on behalf of users. This would also allow institutional archive administrators to enhance user-created metadata. Text conversion and mediated submission may then be necessary but of course come at a price (staff, administrative and equipment costs). It has been argued that institutions are in a good position to subsidize a new service during start up and provide the organizational framework for its continuation. At Nottingham, Library Services is absorbing the e-print facility into its remit.

Users of arXiv can find out about papers (stage 3) in two ways: first by browsing or searching the web interface, secondly through the email alerting service. The web interface of arXiv is famously unspectacular. Apart from the one "gratuitous icon", the interface makes no concessions to aesthetics. Compare it with a commercial provider's site, such as Science Direct [10], or even non-commercial sites like D-Lib Magazine, and the difference is striking. It will be interesting to see how institutional archives begin to appear. Institutional design policies (which often promote the use of graphics and Java-script) are likely to have an influence on institutional e-print server presentation and lead to a departure from the arXiv ethos. Many institutional designers will also be concerned with attracting authors and users to their services and giving the site a feeling of authority in the design itself. But this is not just a matter of aesthetics. The low-tech nature of arXiv means that it has a fast download time, especially using a modem. This is something all designers should consider. Then there is the issue of functionality. The arXiv service includes most of the functionality that users might expect but it is rather unforgiving, especially for beginners. And yet it is particularly beginners that institutional e-print managers will be trying to attract. Institutional designers will perhaps be more concerned with helping all users (including beginners) easily get the most out of the service.

The second way of finding out about papers, the email alerting service, was more important to the researchers we spoke to than we had expected. Researchers can sign up to receive email notification of new additions to sections of the archive. It seems that many would prefer to use this alerting service, since it pushes relevant information to them rather than requiring them to search for it. It is difficult to know how this sort of service might translate to a situation where there are numerous multidisciplinary servers. It would be inconvenient to have to register with large numbers of servers for email alerts. There has been much discussion in the literature on how the metadata from distributed OAI (Open Archives Initiative) registered archives may be searched across the board [11], perhaps more ought to be said about how alerts could be pushed to users.

Authors valued the informal peer review process associated with arXiv (stage 4). Often comments referred to other papers not cited by the author and this could be helpful. The arXiv system has the advantage of subjecting the paper to widespread scrutiny. Other disciplines, however, may do this more informally. Papers may be distributed to colleagues by email for comment. The full-scale pre-print culture may not be something that could easily translate to other disciplines. Some may not want to expose their pre-refereed work in this way. It is common, for example, for faculty to express concern about quality control on e-print servers. They may even see the distribution of pre-prints (or the self-archiving of pre-refereed e-prints) as vanity publishing. Many would prefer only post-refereed material be available. Some professionals in the medical community have gone so far as to say that pre-refereed material may be dangerous in their field if it is used as a basis for clinical practice. However, e-print archives do not have to include pre-prints, at least not for all disciplines. In Physics, pre-prints are an accepted part of scholarly communication, in other disciplines they are not. An e-print archive should include material that is most useful for its users.

There are important questions for multidisciplinary services here. Should a multidisciplinary institutional e-print service accept pre-prints? If so, from all disciplines or just some? Should any attempt be made to monitor quality? After all, an institution has an interest in ensuring that only high quality research output is available from its 'branded' archive. If quality is monitored, what criteria are to be used? Will users see this as unnecessary 'control'? These questions should be addressed in an archive collection development policy. At Nottingham, we feel it is too early to try devising a rigid collection development policy (which should also include file format standards and digital preservation issues mentioned earlier). Rather, we are using the development of a pilot service as a way of working through the issues. At present, we are looking at various forms of research output: pre-prints, book chapters, conference papers, university theses, as well as refereed journal articles. On the specific problem of the need to accept different material types for different subjects, one possible solution might be to set up several archive installations which could then have different collection policies (although they would, of course, remain cross-searchable). This is the approach Caltech seems to be have adopted [12].

Perhaps the most important single issue associated with the current debate on scholarly communication is the relationship between journal publishers and e-print archives. The arXiv service has some interesting features to consider in this area. It is clear that many Physics publishers seem to have accommodated arXiv in their processes. The fact that some allow submission of an arXiv document number (see stage 6) demonstrates this acceptance. This shows a high degree of realism on their part and a willingness to work with the subject community. This will no doubt take time to develop in other subject areas. Institutional archives will not be able to offer such value-added features in the short term.

The attitude of publishers is particularly interesting in relation to the final post-refereed version of a paper (stage 9). It seems that it is common practice to deposit a copy of the final version on arXiv even when it is published in a journal. What about copyright? In some cases, publishers have adapted their copyright agreements to allow self-archiving. For example, the Institute of Physics requires authors to sign over copyright but grants a "personal licence" to authors "to post and update the Work on non-Publisher servers (including e-print servers) as long as access to such servers is not for commercial use and does not depend on payment for access, subscription, or membership fees" [13]. This has "effect from the date when the Work is accepted for publication". Other publishers have similar regulations [14]. Some make a distinction in their copyright regulations between the content itself and the publisher-produced files. Typically, authors are permitted to submit the final version of their paper to arXiv but not the PDF (or other format) produced for the journal. Simeon Warner of arXiv tells me that they reject the publisher-produced PDFs in order to avoid problems. There is evidence though that authors may not always adhere to the precise terms of 'liberal' copyright agreements. For example, the IoP copyright agreement requires authors to give a full citation to the journal article (which normally is done) but also to reproduce certain clauses of the agreement with their paper and to link to the IoP web site (which is not).

Not all publishers are so accommodating. Some copyright agreements (often from large commercial publishers) do not allow self-archiving of the post-refereed version of the paper. But there is evidence to suggest that attempts to restrict publication of the final version of the paper to the journal are being undermined by authors. Many authors deposit a copy of the final version in arXiv anyway. We identified a number of journals in the field with copyright agreements that do not permit final versions of papers to be made available in e-print servers, or in any other form outside of the journal. We then took a random sample of 10 papers in arXiv each of which gave a 'published in' reference to one of these journals and compared the e-print on arXiv with the corresponding article in the journal. In every case, we could not identify any differences between the two versions, except for layout. It seems that many authors are either negotiating exceptions to the copyright restrictions or are simply ignoring them and posting final versions of articles on arXiv regardless.

It is interesting to consider what the policies of institutional archives might be on this issue. It seems unlikely that an institution would want to encourage authors to ignore copyright agreements. Alternatively, they might encourage their members to alter agreements to claim the right to deposit content on e-print servers or to invoke the 'Harnad-Oppenheim strategy' (where the pre-print plus corrigenda are archived) [15]. The latter does not yet seem to have been widely used and will always be second best, but in the short term the Harnad-Oppenheim strategy may be an effective way of getting content into archives. Whether it is acceptable to researchers as a way of presenting their results in the longer term remains to be seen. An alternative institutional strategy might be to claim some sort of copyright on the publications on behalf of the university. The institution could then archive research output as a matter of policy. Such an approach may, however, be difficult to steer through institutional policy-making bodies (even if a case could be made for it in law) and is certainly not a necessary precursor to self-archiving.

One thing is clear. Researchers who contribute to arXiv now consider it to have a central place in their work. They use it to disseminate both pre-prints and post-refereed articles. Interestingly, they still wish to have their work accepted by journals, and endorsed by the formal peer review process, but do not see journals as the only means of distributing their work. In other words, self-archiving is not seen as a substitute for publishing in peer-reviewed journals, but rather a useful supplement to the journal publishing process that makes research output widely available. Whilst this way of working has not permeated into many other fields, it is easy to see its potential, even bearing in mind different communication conventions between disciplines. In view of arXiv's success, there is clearly a need to discuss the issues more widely in institutions across different academic fields. The potential benefits have been discussed at length elsewhere [16], but until recently this discussion has been confined to a select band of people. It is important that archive developers not concentrate only on practical implementation issues (some of which have been touched on here) but also retain a clear view of the big picture and of the need to consult researchers. The big picture is the aim of freeing-up research output and thereby improving research communication. This is certainly worth pursuing.

There is also potential for the development of historical archives. Whilst we often think of the raison d'être of e-print archives being the dissemination of the latest research results, they could also become major historical archives. Interestingly, there is evidence that arXiv itself is already being used in this way. Many authors have deposited papers published before the Los Alamos service was set up as a way of improving the availability of their earlier output. It is not unusual to find papers from the 1970s or 1980s on arXiv.

The Nottingham Implementation

Studying arXiv usage has helped us at Nottingham focus on these important issues. The Nottingham e-print archive pilot project has two strands, both of which have been informed by our investigation. The first strand is service implementation; the second strand is advocacy. The service is currently being implemented using version 1.1.1 of the OAI-compliant eprints.org software [17]. The pilot Nottingham e-print archive has been set up and tailored to give it a local look and feel. For demonstration purposes, the archive is currently being populated with existing material by Nottingham researchers.

At the same time, we are trying to raise the profile of the issues associated with e-prints in the institution. Researchers in most disciplines still need persuading that e-print archives may be the right way forward in their field. This persuasion may take various forms. For example, it may take the form of explanation. One of the physicists we spoke to said that if people from other disciplines really understood the potential of archiving their e-prints, they would start doing it. This optimistic view may apply to some researchers but not all. Others may be encouraged by institutional policies. It is important to win the support of senior members of the University who can encourage participation. In addition, it may be possible to attract users by offering them 'value added services'. These might include reports of hit counts on papers, citation analyses [18], and the ability to generate lists of publications for personal and departmental web sites, as well as the assistance with formatting and depositing already suggested. We are currently considering all of these at Nottingham as ways of helping us address the most important issue of getting a critical mass of content in place.

In many respects the pilot project is a pragmatic way of helping us work through some of the key issues identified in this paper. Identifying these issues in the first place has itself been important. We have had a number of key issues highlighted for us, including technical issues (such as file formats and user interface design), management issues (such as submission procedures and administrative staff support), economic issues (such as installation and support costs), quality issues (such as peer review and quality control criteria), policy issues (such as digital preservation and collection development standards), academic issues (such as scholarly communication cultures and publishing trends), and legal issues (such as copyright and intellectual property rights). All have been flagged as being important for both subject-specific and multidisciplinary institutional e-print services. Combining the implementation of a pilot server with the promotion of wide discussion of the issues is a good way of testing how flexible the e-prints model actually is and how easily it can adapt itself to other disciplines.

Conclusion

Looking at how arXiv is used on the ground was very useful for us. It has helped us identify some of the things that are most important to researchers and to consider how these might be of practical significance in running an multidisciplinary institutional service. There are, of course, differences in how a subject-based service and an institutional service may work, just as there are differences in the way different disciplines themselves work. The arXiv service was designed by Physicists for Physicists, but at least some of the principles upon which it is based have the potential to be transferable. Many of the practical managerial considerations of running an e-print service should be the same across e-print archives for different disciplines. However, the arXiv service has been ten years in the making. For this reason, perhaps we should not expect instant results from our new services. Rather, we need to continue working through the issues discussed in this article and see how other services might develop in practice. Many people in institutions (researchers, librarians and managers) are beginning to see the potential of e-print archives and to think about the issues involved. Perhaps the best thing to do now is to give it a try.

Acknowledgements

Thanks to Dr John Barrett, Prof. Stevan Harnad, Simeon Warner, and John MacColl for comments on drafts of this paper. Particular thanks are due to Dr Mike Gardner for his assistance in researching this article and in commenting on drafts.

Notes

[1] Nottingham e-Prints at <http://www-db.library.nottingham.ac.uk/eprints>.

[2] eprints.org at <http://www.eprints.org>.

[3] arXiv.org e-Print archive at <http://www.arxiv.org>. The service has recently moved to Cornell.

[4] See Stevan Harnad, For Whom the Gate Tolls? How and Why to Free the Refereed Research Literature Online Through Author/Institution Self-Archiving, Now, Section 7. Available at <http://www.cogsci.soton.ac.uk/~harnad/Tp/resolution.htm#7>; Stevan Harnad, 'The self-archiving initiative', Nature: webdebates. Available at <http://www.nature.com/nature/debates/e-access/Articles/harnad.html>.

[5] For example Richard E. Luce, 'E-prints Intersect the Digital Library: Inside the Los Alamos arXiv', Issues in Science & Technology Librarianship, Vol. 29, Winter 2001. Available at <http://www.library.ucsb.edu/istl/01-winter/article3.html>. See also Irma S. Holtkamp and Donna A. Berg (ed.s), 'The Impact of Paul Ginsparg's ePrint arXiv (Formerly Known as xxx.lanl.gov) at Los Alamos National Laboratory on Scholarly Communications and Publishing: A Selected Bibliography'. Available at <http://lib-www.lanl.gov/libinfo/preprintsbib.htm>.

[6] RePEc <http://repec.org>.

[7] For an interesting discussion on differences between disciplines in relation to their adoption of electronic communication see Bob Kling and Geoffrey McKim, 'Not Just a Matter of Time: Field Differences and the Shaping of Electronic Media in Supporting Scientific Communication', Journal of the American Society for Information Science, Vol. 51, No. 14, 2000, pp. 1306-1320. E-print available at <http://arxiv.org/abs/cs.CY/9909008>.

[8] See for example the preliminary results of the eprints.org survey available at <http://www.eprints.org/results>. See also Tim Brody, Mining the social life of an eprint archive. Available at <http://opcit.eprints.org/tdb198/opcit>. Ian Hickman, Mining the social life of an eprint archive. Available at <http://opcit.eprints.org/ijh198>.

[9] See Keith Glavash, Bill Comstock and Larry Stone, 'Carrots and sticks: getting students to submit electronic theses at MIT'. Paper presented at Third International Symposium on Electronic Theses and Dissertations, 16-18 March, 2000. Available at <http://mit.edu/kglavash/www/E-Theses.carrots&sticks.pdf>. Many of the same formatting questions apply for e-prints as e-theses.

[10] ScienceDirect <http://www.sciencedirect.com>.

[11] See for example, Xiaoming Liu et al, 'Arc - An OAI Service Provider for Digital Library Federation', D-Lib Magazine, Vol 7, No. 4, April 2001. Available at <http://www.dlib.org/dlib/april01/liu/04liu.html>. For the Open Archives Initiative, see <http://www.openarchives.org>.

[12] Caltech Library Digital Collections <http://library.caltech.edu/digital>.

[13] Institute of Physics, Journals: Notes for Authors <http://www.iop.org/Journals/nfa/aocform.html>.

[14] Other examples include the American Institute of Physics, see for example Journal of Mathematical Physics <http://ojps.aip.org/jmp/jmpcr.jsp>; the American Psychological Association, see <http://www.apa.org/journals/posting.html>; Emerald, see <http://www.emeraldinsight.com/charter/index.htm>; and Wiley, see for example the Copyright Transfer Agreement of the Journal of the American Society for Information Science at <http://www.asis.org/Publications/JASIS/cta.html>.

[15] See Stevan Harnad, For Whom the Gate Tolls? How and Why to Free the Refereed Research Literature Online Through Author/Institution Self-Archiving, Now, Section 6. Available at <http://www.cogsci.soton.ac.uk/~harnad/Tp/resolution.htm#Harnad/Oppenheim>.

[16] See for example Stevan Harnad, 'The self-archiving initiative' Nature: webdebates. Available at <http://www.nature.com/nature/debates/e-access/Articles/harnad.html>. And Jane Garner, Lynne Horwood and Shirley Sullivan, 'The place of eprints in scholarly information delivery', Online Information Review, Vol. 25, No. 4, 2001, pp. 250-256.

[17] The eprints.org software has been reviewed by implementers at CalTech, see Ed Sponsler and Eric F. Van de Velde, 'Eprints.org Software: a Review', July 2001. E-print available at <http://resolver.library.caltech.edu/caltechLIB:2001.004>. This review has also been published on the SPARC web site, see <http://www.arl.org/sparc/core/index.asp?page=g20#6>.

[18] There is considerable potential here in tools such as Cite-Base developed at the University of Southampton, see <http://cite-base.ecs.soton.ac.uk/help/index.php3>.

Copyright 2001 Stephen Pinfield
spacer
spacer

Top | Contents
Search | Author Index | Title Index | Back Issues
Previous article | Next article
Home | E-mail the Editor

spacer
spacer

D-Lib Magazine Access Terms and Conditions

DOI: 10.1045/december2001-pinfield