Open Archives Initiative Protocol for Metadata Harvesting

Last updated

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol developed for harvesting metadata descriptions of records in an archive so that services can be built using metadata from many archives. An implementation of OAI-PMH must support representing metadata in Dublin Core, but may also support additional representations. [1] [2]

Contents

The protocol is usually just referred to as the OAI Protocol.

OAI-PMH uses XML over HTTP. Version 2.0 of the protocol was released in 2002; the document was last updated in 2015. It has a Creative Commons license BY-SA.

History

In the late 1990s, Herbert Van de Sompel (Ghent University) was working with researchers and librarians at Los Alamos National Laboratory (US) and called a meeting to address difficulties related to interoperability issues of e-print servers and digital repositories. The meeting was held in Santa Fe, New Mexico, in October 1999. [3] A key development from the meeting was the definition of an interface that permitted e-print servers to expose metadata for the papers it held in a structured fashion so other repositories could identify and copy papers of interest with each other. This interface/protocol was named the "Santa Fe Convention". [1] [2] [4]

Several workshops were held in 2000 at the ACM Digital Libraries conference, [5] at the 1st ACM/IEEE-CS joint conference on Digital libraries [6] [7] and elsewhere to share the ideas from the Santa Fe Convention. [8] It was discovered at the workshops that the problems faced by the e-print community were also shared by libraries, museums, journal publishers, and others who needed to share distributed resources. To address these needs, the Coalition for Networked Information [9] and the Digital Library Federation [10] provided funding to establish an Open Archives Initiative (OAI) secretariat managed by Herbert Van de Sompel and Carl Lagoze. The OAI held a meeting at Cornell University (Ithaca, New York) in September 2000 aimed to improve the interface developed at the Santa Fe Convention. [11] The specifications were refined over e-mail.

OAI-PMH version 1.0 was introduced to the public in January 2001 at a workshop in Washington D.C., [12] and another in February in Berlin, Germany. [13] Subsequent modifications to the XML standard by the W3C required making minor modifications to OAI-PMH resulting in version 1.1. The current version, 2.0, was released in June 2002. It contained several technical changes and enhancements and is not backward compatible. [14]

From 2001 CERN, and later in collaboration with University of Geneva, has organized bi-annual OAI workshops, [15] which over time have developed to cover most aspects of open science. Since 2021 the workshop series is named the Geneva Workshop on Innovations in Scholarly Communication, with the nick name OAI reflecting its origin. [16]

Uses

Some commercial search engines use OAI-PMH to acquire more resources. Google initially included support for OAI-PMH when launching sitemaps, however decided to support only the standard XML Sitemaps format in May 2008. [17] In 2004, Yahoo! acquired content from OAIster (University of Michigan) that was obtained through metadata harvesting with OAI-PMH. Wikimedia uses an OAI-PMH repository to provide feeds of Wikipedia and related site updates for search engines and other bulk analysis/republishing endeavors. [18] Especially when dealing with thousands of files being harvested every day, OAI-PMH can help in reducing the network traffic and other resource usage by doing incremental harvesting. [19] NASA's Mercury metadata search system uses OAI-PMH to index thousands of metadata records from Global Change Master Directory (GCMD) every day. [20]

The mod_oai project is using OAI-PMH to expose content to web crawlers that is accessible from Apache Web servers.

OAI-PMH has later been applied to sharing of scientific data. [21]

Software

OAI-PMH is based on a client–server architecture, in which "harvesters" request information on updated records from "repositories". Requests for data can be based on a datestamp range, and can be restricted to named sets defined by the provider. Data providers are required to provide XML metadata in Dublin Core format, and may also provide it in other XML formats.

A number of software systems support the OAI-PMH, including Fedora, EThOS from the British Library, GNU EPrints from the University of Southampton, Open Journal Systems from the Public Knowledge Project, Desire2Learn, DSpace from MIT, HyperJournal from the University of Pisa, Digibib from Digibis, MyCoRe, Koha, Primo, DigiTool, Rosetta and MetaLib from Ex Libris, ArchivalWare from PTFS, DOOR [22] from the eLab [23] in Lugano, Switzerland, panFMP from the PANGAEA (data library), [24] SimpleDL from Roaring Development, and jOAI from the National Center for Atmospheric Research. [25]

Archives

A number of large archives support the protocol including arXiv and the CERN Document Server.

See also

Related Research Articles

<span class="mw-page-title-main">Open Archives Initiative</span>

The Open Archives Initiative (OAI) was an informal organization, in the circle around the colleagues Herbert Van de Sompel, Carl Lagoze, Michael L. Nelson and Simeon Warner, to develop and apply technical interoperability standards for archives to share catalogue information (metadata). The group got together in the late late 1990s and was active for around twenty years. OAI coordinated in particular three specification activities: OAI-PMH, OAI-ORE and ResourceSync. All along the group worked towards building a "low-barrier interoperability framework" for archives containing digital content to allow people harvest metadata. Such sets of metadata are since then harvested to provide "value-added services", often by combining different data sets.

CiteSeerX is a public search engine and digital library for scientific and academic papers, primarily in the fields of computer and information science.

Z39.50 is an international standard client–server, application layer communications protocol for searching and retrieving information from a database over a TCP/IP computer network, developed and maintained by the Library of Congress. It is covered by ANSI/NISO standard Z39.50, and ISO standard 23950.

An institutional repository is an archive for collecting, preserving, and disseminating digital copies of the intellectual output of an institution, particularly a research institution. Academics also utilize their IRs for archiving published works to increase their visibility and collaboration with other academics However, most of these outputs produced by universities are not effectively accessed and shared by researchers and other stakeholders As a result Academics should be involved in the implementation and development of an IR project so that they can learn the benefits and purpose of building an IR.

Sitemaps is a protocol in XML format meant for a webmaster to inform search engines about URLs on a website that are available for web crawling. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs of the site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from the rest of the site's content. The Sitemaps protocol is a URL inclusion protocol and complements robots.txt, a URL exclusion protocol.

<span class="mw-page-title-main">Fedora Commons</span>

Fedora is a digital asset management (DAM) content repository architecture upon which institutional repositories, digital archives, and digital library systems might be built. Fedora is the underlying architecture for a digital repository, and is not a complete management, indexing, discovery, and delivery application. It is a modular architecture built on the principle that interoperability and extensibility are best achieved by the integration of data, interfaces, and mechanisms as clearly defined modules.

mod_oai is an Apache module that allows web crawlers to efficiently discover new, modified, and deleted web resources from a web server by using OAI-PMH, a protocol which is widely used in the digital libraries community. mod_oai also allows harvesters to obtain "archive-ready" resources from a web server.

ScientificCommons was a project of the University of St. Gallen Institute for Media and Communications Management. The major aim of the project was to develop the world’s largest archive of scientific knowledge with fulltexts freely accessible to the public. The project was closed down in 2014.

<span class="mw-page-title-main">BASE (search engine)</span> Academic search engine

BASE is a multi-disciplinary search engine to scholarly internet resources, created by Bielefeld University Library in Bielefeld, Germany. It is based on free and open-source software such as Apache Solr and VuFind. It harvests OAI metadata from institutional repositories and other academic digital libraries that implement the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), and then normalizes and indexes the data for searching. In addition to OAI metadata, the library indexes selected web sites and local data collections, all of which can be searched via a single search interface.

EPrints is a free and open-source software package for building open access repositories that are compliant with the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). It shares many of the features commonly seen in document management systems, but is primarily used for institutional repositories and scientific journals. EPrints has been developed at the University of Southampton School of Electronics and Computer Science and released under the GPL-3.0-or-later license.

The Redalyc project is a bibliographic database and a digital library of Open Access journals, supported by the Universidad Autónoma del Estado de México with the help of numerous other higher education institutions and information systems.

PREservation Metadata: Implementation Strategies (PREMIS) is the de facto digital preservation metadata standard.

The Open Archives Initiative Object Reuse and Exchange (OAI-ORE) defines standards for the description and exchange of aggregations of web resources. The OAI-ORE specification implements the ORE Model which introduces the resource map (ReM) that makes it possible to associate an identity with aggregations of resources and make assertions about their structure and semantics.

A resource map (ReM) is a concept of the ORE Model for associating an identity with compound digital objects and making assertions about their structure and semantics. Compound objects combine distributed resources, including multiple media types.

<span class="mw-page-title-main">Herbert Van de Sompel</span> Belgian librarian and information scientist

Herbert Van de Sompel is a Belgian librarian, computer scientist, and musician, most known for his role in the development of the Open Archives Initiative (OAI) and standards such as OpenURL, Object Reuse and Exchange, and the OAI Protocol for Metadata Harvesting.

Invenio is an open source software framework for large-scale digital repositories that provides the tools for management of digital assets in an institutional repository and research data management systems. The software is typically used for open access repositories for scholarly and/or published digital content and as a digital library.

The OpenSIGLE repository provides open access to the bibliographic records of the former SIGLE database. The creation of the OpenSIGLE archive was decided by some major European STI centres, members of the former European network EAGLE for the collection and dissemination of grey literature. OpenSIGLE was developed by the French INIST-CNRS, with assistance from the German FIZ Karlsruhe and the Dutch Grey Literature Network Service (GreyNet). OpenSIGLE is hosted on an INIST-CNRS server at Nancy. Part of the open Access movement, OpenSIGLE is referenced by the international Directory of Open Access Repositories.

An open repository or open-access repository is a digital platform that holds research output and provides free, immediate and permanent access to research results for anyone to use, download and distribute. To facilitate open access such repositories must be interoperable according to the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Search engines harvest the content of open access repositories, constructing a database of worldwide, free of charge available research.

<span class="mw-page-title-main">OPUS (software)</span>

OPUS is an open-source software package under the GNU General Public License used for creating Open Access repositories that are compliant with the Open Archives Initiative Protocol for Metadata Harvesting. It provides tools for creating collections of digital resources, as well as for their storage and dissemination. It is usually used at universities, libraries and research institutes as a platform for institutional repositories.

MyCoRe is an open source repository software framework for building disciplinary or institutional repositories, digital archives, digital libraries, and scientific journals. The software is developed at various German university libraries and computer centers. Although most MyCoRe web applications are located in Germany, there are English-language applications, such as "The International Treasury of Islamic Manuscripts" at the University of Cambridge (UK).

References

  1. 1 2 Lynch, Clifford A. (August 2001). "Metadata harvesting and the Open Archives Initiative". ARL: A Bimonthly Report (217). Archived from the original (PDF) on 25 May 2012.{{cite journal}}: CS1 maint: date and year (link)
  2. 1 2 Marshall Breeding (September 2002). "Understanding the Protocol for Metadata Harvesting of the Open Archives Initiative". Computers in Libraries. 22 (8): 24–29. Retrieved 2021-02-08.
  3. Marshall, E. (1999). "Researchers plan free global preprint archive". Science. 286 (5441): 887a–887. doi:10.1126/science.286.5441.887a. PMID   10577235. S2CID   178990556.
  4. "The Santa Fe Convention by the Open Archives Initiative". Open Archives Initiative. February 15, 2000. Retrieved May 29, 2022.
  5. "The Santa Fe Convention of the Open Archives Initiative". dspace.library.uu.nl. Retrieved 2021-02-10.
  6. Edward A. Fox; Christine L. Borgman, eds. (2001). "Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries". Joint Conference on Digital Libraries. Roanoke, Virginia, United States: ACM Press. doi:10.1145/379437. ISBN   978-1-58113-345-5.
  7. Lagoze, Carl; Van de Sompel, Herbert (2001). "The open archives initiative: building a low-barrier interoperability framework". Proceedings of the First ACM/IEEE-CS Joint Conference on Digital Libraries - JCDL '01. Roanoke, Virginia, United States: ACM Press: 54–62. CiteSeerX   10.1.1.161.6800 . doi:10.1145/379437.379449. ISBN   978-1-58113-345-5. S2CID   1315824.
  8. Van de Sompel, Herbert; Lagoze, Carl (2000). "The Santa Fe Convention of the Open Archives Initiative". D-Lib Magazine. 6 (2). doi: 10.1045/february2000-vandesompel-oai . ISSN   1082-9873.
  9. "Homepage". Coalition for Networked Information. Retrieved May 29, 2022.
  10. "Homepage". Digital Library Federation. Retrieved May 29, 2022.
  11. "OAi-tech Meeting, Cornell University, September 7-8 2000". www.openarchives.org. Retrieved 2021-02-10.
  12. "The Open Archives Initiative: Open Meeting Renaissance Hotel, Washington DC January 23, 2001". www.openarchives.org. Retrieved 2021-02-10.
  13. "The Open Archives Initiative: Open Meeting Staatsbibliothek zu Berlin, Germany February 26, 2001". www.openarchives.org. Retrieved 2021-02-10.
  14. Van de Sompel, Herbert; Young, Jeffrey A.; Hickey, Thomas B. (2003). "Using the OAI-PMH ... Differently". D-Lib Magazine. 9 (7/8). doi: 10.1045/july2003-young . ISSN   1082-9873.
  15. "Previous OAI Workshops – OAI". The Geneva Workshop on Innovations in Scholarly Communication. Retrieved 2023-01-13.
  16. Azwa, Adnan Siti Norfateha. "Library Guide: Open Access Guide: The Latest on OA". umlibguides.um.edu.my. Retrieved 2023-01-13.
  17. "Retiring Support for OAI-PMH in Sitemaps". Google Search Central Blog. April 23, 2008. Retrieved May 29, 2022.
  18. "Wikimedia update feed service". Wikimedia Meta-Wiki. Retrieved 14 July 2013.{{cite journal}}: Cite journal requires |journal= (help)
  19. "OAI Harvesting System". DLXS. Retrieved May 29, 2022.
  20. R. Devarakonda; G. Palanisamy; J. Green; B. Wilson (2010). "Data sharing and retrieval uses OAI-PMH". Earth Science Informatics. Springer Berlin / Heidelberg. 4 (1): 1–5. doi:10.1007/s12145-010-0073-0. S2CID   46330319.
  21. Devarakonda, Ranjeet; Palanisamy, Giri; Green, James M.; Wilson, Bruce E. (2011). "Data sharing and retrieval using OAI-PMH". Earth Science Informatics. 4 (1): 1–5. doi:10.1007/s12145-010-0073-0. ISSN   1865-0473. S2CID   46330319.
  22. "Overview". DOOR. Retrieved May 29, 2022.
  23. "eLab". Universita della Svizzera italiana (in Italian). Retrieved May 29, 2022.
  24. "PANGAEA® Framework for Metadata Portals". panfmp.org.
  25. "NCAR/joai-project". Github.com. 31 May 2022.