CiteSeerX

Last updated
CiteSeerX
Type of site
Bibliographic database
Available inEspañol
Owner Pennsylvania State University College of Information Sciences and Technology
Revenue Active
URL citeseerx.ist.psu.edu OOjs UI icon edit-ltr-progressive.svg
RegistrationOptional
Launched2008;16 years ago (2008) / 1997;27 years ago (1997)
Current statusActive
Content license
Creative Commons BY-NC-SA license [1]

CiteSeerX (formerly called CiteSeer) is a public search engine and digital library for scientific and academic papers, primarily in the fields of computer and information science.

Contents

CiteSeer's goal is to improve the dissemination and access of academic and scientific literature. As a non-profit service that can be freely used by anyone, it has been considered part of the open access movement that is attempting to change academic and scientific publishing to allow greater access to scientific literature. CiteSeer freely provided Open Archives Initiative metadata of all indexed documents and links indexed documents when possible to other sources of metadata such as DBLP and the ACM Portal. To promote open data, CiteSeerX shares its data for non-commercial purposes under a Creative Commons license. [1]

CiteSeer is considered a predecessor of academic search tools such as Google Scholar and Microsoft Academic Search. [2] CiteSeer-like engines and archives usually only harvest documents from publicly available websites and do not crawl publisher websites. For this reason, authors whose documents are freely available are more likely to be represented in the index.

CiteSeer changed its name to ResearchIndex at one point and then changed it back. [3]

History

CiteSeer and CiteSeer.IST

CiteSeer was created by researchers Lee Giles, Kurt Bollacker and Steve Lawrence in 1997 while they were at the NEC Research Institute (now NEC Labs), Princeton, New Jersey, US. CiteSeer's goal was to actively crawl and harvest academic and scientific documents on the web and use autonomous citation indexing to permit querying by citation or by document, ranking them by citation impact. At one point, it was called ResearchIndex.

CiteSeer became public in 1998 and had many new features unavailable in academic search engines at that time. These included:

CiteSeer was granted a United States patent # 6289342, titled "Autonomous citation indexing and literature browsing using citation context", on September 11, 2001. The patent was filed on May 20, 1998, and has priority to January 5, 1998. A continuation patent (US Patent # 6738780) was filed on May 16, 2001, and granted on May 18, 2004.[ citation needed ]

After NEC, in 2004 it was hosted as CiteSeer.IST on the World Wide Web at the College of Information Sciences and Technology, The Pennsylvania State University, and had over 700,000 documents. For enhanced access, performance and research, similar versions of CiteSeer were supported at universities such as the Massachusetts Institute of Technology, University of Zürich and the National University of Singapore. However, these versions of CiteSeer proved difficult to maintain and are no longer available. Because CiteSeer only indexes freely available papers on the web and does not have access to publisher metadata, it returns fewer citation counts than sites, such as Google Scholar, that have publisher metadata.

CiteSeer had not been comprehensively updated since 2005 due to limitations in its architecture design. It had a representative sampling of research documents in computer and information science but was limited in coverage because it was limited to papers that are publicly available, usually at an author's homepage, or those submitted by an author. To overcome some of these limitations, a modular and open source architecture for CiteSeer was designed – CiteSeerX.

CiteSeerX

CiteSeerX replaced CiteSeer and all queries to CiteSeer were redirected. CiteSeerX [4] is a public search engine and digital library and repository for scientific and academic papers, primarily with a focus on computer and information science. [4] However, recently CiteSeerX has been expanding into other scholarly domains such as economics, physics and others. Released in 2008, it was loosely based on the previous CiteSeer search engine and digital library and is built with a new open source infrastructure, SeerSuite, and new algorithms and their implementations. It was developed by researchers Isaac Councill and C. Lee Giles at the College of Information Sciences and Technology, Pennsylvania State University. It continues to support the goals outlined by CiteSeer to actively crawl and harvest academic and scientific documents on the public web and to use a citation inquiry by citations and ranking of documents by the impact of citations. Currently, Lee Giles, Prasenjit Mitra, Susan Gauch, Min-Yen Kan, Pradeep Teregowda, Juan Pablo Fernández Ramírez, Pucktada Treeratpituk, Jian Wu, Douglas Jordan, Steve Carman, Jack Carroll, Jim Jansen, and Shuyi Zheng are or have been actively involved in its development. Recently, a table search feature was introduced. [5] It has been funded by the National Science Foundation, NASA, and Microsoft Research.

CiteSeerX continues to be rated as one of the world's top repositories, and was rated number 1 in July 2010. [6] It currently has over 6 million documents with nearly 6 million unique authors and 120 million citations.[ timeframe? ]

CiteSeerX also shares its software, data, databases and metadata with other researchers, currently by Amazon S3 and by rsync. [7] Its new modular open source architecture and software (available previously on SourceForge but now on GitHub) is built on Apache Solr and other Apache and open source tools, which allows it to be a testbed for new algorithms in document harvesting, ranking, indexing, and information extraction.

CiteSeerX caches some PDF files that it has scanned. As such, each page includes a DMCA link which can be used to report copyright violations. [8]

Current features

Automated information extraction

CiteSeerX uses automated information extraction tools, usually built on machine learning methods such ParsCit, to extract scholarly document metadata such as title, authors, abstract, citations, etc. As such, there are sometime errors in authors and titles. Other academic search engines have similar errors.

Focused crawling

CiteSeerX crawls publicly available scholarly documents primarily from author webpages and other open resources, and does not have access to publisher metadata. As such, citation counts in CiteSeerX are usually less than those in Google Scholar and Microsoft Academic Search who have access to publisher metadata.

Usage

CiteSeerX has nearly one million users worldwide based on unique IP addresses and has millions of hits daily. Annual downloads of document PDFs were nearly 200 million for 2015.

Data

CiteSeerX data is regularly shared under a Creative Commons BY-NC-SA license with researchers worldwide and has been and is used in many experiments and competitions.

Thanks to its OAI-PMH endpoint, [9] CiteSeerX is an open archive and its content is indexed like an institutional repository in academic search engines, for instance BASE and Unpaywall consumers.

Other SeerSuite-based search engines

The CiteSeer model had been extended to cover academic documents in business with SmealSearch and in e-business with eBizSearch. However, these were not maintained by their sponsors. An older version of both of these could be once found at BizSeer.IST but is no longer in service.

Other Seer-like search and repository systems have been built for chemistry, ChemXSeer and for archaeology, ArchSeer. Another had been built for robots.txt file search, BotSeer. All of these are built on the open source tool SeerSuite, which uses the open source indexer Lucene.

See also

Related Research Articles

<span class="mw-page-title-main">Web crawler</span> Software which systematically browses the World Wide Web

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.

A citation index is a kind of bibliographic index, an index of citations between publications, allowing the user to easily establish which later documents cite which earlier documents. A form of citation index is first found in 12th-century Hebrew religious literature. Legal citation indexes are found in the 18th century and were made popular by citators such as Shepard's Citations (1873). In 1961, Eugene Garfield's Institute for Scientific Information (ISI) introduced the first citation index for papers published in academic journals, first the Science Citation Index (SCI), and later the Social Sciences Citation Index (SSCI) and the Arts and Humanities Citation Index (AHCI). American Chemical Society converted its printed Chemical Abstract Service into internet-accessible SciFinder in 2008. The first automated citation indexing was done by CiteSeer in 1997 and was patented. Other sources for such data include Google Scholar, Microsoft Academic, Elsevier's Scopus, and the National Institutes of Health's iCite.

The deep web, invisible web, or hidden web are parts of the World Wide Web whose contents are not indexed by standard web search-engine programs. This is in contrast to the "surface web", which is accessible to anyone using the Internet. Computer scientist Michael K. Bergman is credited with inventing the term in 2001 as a search-indexing term.

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol developed for harvesting metadata descriptions of records in an archive so that services can be built using metadata from many archives. An implementation of OAI-PMH must support representing metadata in Dublin Core, but may also support additional representations.

SmealSearch was a web portal, search engine and digital library for academic business documents that was originally hosted at the defunct eBusiness Research Center at the Pennsylvania State University. It was based on the CiteSeer digital library and search engine technology. Due to lack of support, it moved to the College of Information Sciences and Technology and became BizSeer. It was enhanced and modified by many including Lee Giles, Yang Sun, Sandip Debnath, Isaac Councill, Arvind Rangaswamy, Nirmal Pal, Yves Petinot and Pradeep Teregowda.

Scientometrics is a subfield of informetrics that studies quantitative aspects of scholarly literature. Major research issues include the measurement of the impact of research papers and academic journals, the understanding of scientific citations, and the use of such measurements in policy and management contexts. In practice there is a significant overlap between scientometrics and other scientific fields such as information systems, information science, science of science policy, sociology of science, and metascience. Critics have argued that overreliance on scientometrics has created a system of perverse incentives, producing a publish or perish environment that leads to low-quality research.

Citation analysis is the examination of the frequency, patterns, and graphs of citations in documents. It uses the directed graph of citations — links from one document to another document — to reveal properties of the documents. A typical aim would be to identify the most important documents in a collection. A classic example is that of the citations between academic articles and books. For another example, judges of law support their judgements by referring back to judgements made in earlier cases. An additional example is provided by patents which contain prior art, citation of earlier patents relevant to the current claim. The digitization of patent data and increasing computing power have led to a community of practice that uses these citation data to measure innovation attributes, trace knowledge flows, and map innovation networks.

<span class="mw-page-title-main">Google Scholar</span> Academic search service by Google

Google Scholar is a freely accessible web search engine that indexes the full text or metadata of scholarly literature across an array of publishing formats and disciplines. Released in beta in November 2004, the Google Scholar index includes peer-reviewed online academic journals and books, conference papers, theses and dissertations, preprints, abstracts, technical reports, and other scholarly literature, including court opinions and patents.

Clyde Lee Giles is an American computer scientist and the David Reese Professor at the College of Information Sciences and Technology (IST) at the Pennsylvania State University. He is also Graduate Faculty Professor of Computer Science and Engineering, Courtesy Professor of Supply Chain and Information Systems, and Director of the Intelligent Systems Research Laboratory. He was Interim Associate Dean of Research in the College of IST. He graduated from Oakhaven High School in Memphis, Tennessee. His graduate degrees are from the University of Michigan and the University of Arizona and his undergraduate degrees are from Rhodes College and the University of Tennessee. His PhD is in optical sciences with advisor Harrison H. Barrett. His academic genealogy includes two Nobel laureates, Arnold Sommerfeld and prominent mathematicians.

ScientificCommons was a project of the University of St. Gallen Institute for Media and Communications Management. The major aim of the project was to develop the world’s largest archive of scientific knowledge with fulltexts freely accessible to the public. The project was closed down in 2014.

<span class="mw-page-title-main">BASE (search engine)</span> Academic search engine

BASE is a multi-disciplinary search engine to scholarly internet resources, created by Bielefeld University Library in Bielefeld, Germany. It is based on free and open-source software such as Apache Solr and VuFind. It harvests OAI metadata from institutional repositories and other academic digital libraries that implement the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), and then normalizes and indexes the data for searching. In addition to OAI metadata, the library indexes selected web sites and local data collections, all of which can be searched via a single search interface.

ChemXSeer project, funded by the National Science Foundation, is a public integrated digital library, database, and search engine for scientific papers in chemistry. It is being developed by a multidisciplinary team of researchers at the Pennsylvania State University. ChemXSeer was conceived by Dr. Prasenjit Mitra, Dr. Lee Giles and Dr. Karl Mueller as a way to integrate the chemical scientific literature with experimental, analytical, and simulation data from different types of experimental systems. The goal of the project is to create an intelligent search and database which will provide access to relevant data to a diverse community of users who have a need for chemical information. It is hosted on the World Wide Web at the College of Information Sciences and Technology, The Pennsylvania State University.

<span class="mw-page-title-main">Digital library</span> Online database of digital objects stored in electronic media formats and accessible via computers

A digital library is an online database of digital objects that can include text, still images, audio, video, digital documents, or other digital media formats or a library accessible through the internet. Objects can consist of digitized content like print or photographs, as well as originally produced digital content like word processor files or social media posts. In addition to storing content, digital libraries provide means for organizing, searching, and retrieving the content contained in the collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals or organizations. The digital content may be stored locally, or accessed remotely via computer networks. These information retrieval systems are able to exchange information with each other through interoperability and sustainability.

Folksonomy is a classification system in which end users apply public tags to online items, typically to make those items easier for themselves or others to find later. Over time, this can give rise to a classification system based on those tags and how often they are applied or searched for, in contrast to a taxonomic classification designed by the owners of the content and specified when it is published. This practice is also known as collaborative tagging, social classification, social indexing, and social tagging. Folksonomy was originally "the result of personal free tagging of information [...] for one's own retrieval", but online sharing and interaction expanded it into collaborative forms. Social tagging is the application of tags in an open online environment where the tags of other users are available to others. Collaborative tagging is tagging performed by a group of users. This type of folksonomy is commonly used in cooperative and collaborative projects such as research, content repositories, and social bookmarking.

AMiner is a free online service used to index, search, and mine big scientific data.

A disciplinary repository is an online archive containing works or data associated with these works of scholars in a particular subject area. Disciplinary repositories can accept work from scholars from any institution. A disciplinary repository shares the roles of collecting, disseminating, and archiving work with other repositories, but is focused on a particular subject area. These collections can include academic and research papers.

Data publishing is the act of releasing research data in published form for use by others. It is a practice consisting in preparing certain data or data set(s) for public use thus to make them available to everyone to use as they wish. This practice is an integral part of the open science movement. There is a large and multidisciplinary consensus on the benefits resulting from this practice.

<span class="mw-page-title-main">CORE (research service)</span>

CORE is a service provided by the Knowledge Media Institute based at The Open University, United Kingdom. The goal of the project is to aggregate all open access content distributed across different systems, such as repositories and open access journals, enrich this content using text mining and data mining, and provide free access to it through a set of services. The CORE project also aims to promote open access to scholarly outputs. CORE works closely with digital libraries and institutional repositories.

<span class="mw-page-title-main">Microsoft Academic</span> Online bibliographic database

Microsoft Academic was a free internet-based academic search engine for academic publications and literature, developed by Microsoft Research in 2016 as a successor of Microsoft Academic Search. Microsoft Academic was shut down in 2022. Both OpenAlex and The Lens claim to be successors to Microsoft Academic.

References

  1. 1 2 "CiteSeerX Data Policy". Archived from the original on 2012-01-05. Retrieved 2015-11-10.
  2. Kodakateri Pudhiyaveetil, Ajith; Gauch, Susan; Luong, Hiep; Eno, Josh (2009). "Conceptual recommender system for CiteSeerX". Proceedings of the third ACM conference on Recommender systems. New York, New York, US: ACM Press. p. 241. doi:10.1145/1639714.1639758. ISBN   978-1-60558-435-5. S2CID   13900679.
  3. Lawrence, Steve (2001). "ResearchIndex: Inside the world's largest free full-text index of scientific literature". Proceedings of the international conference on Knowledge capture - K-CAP 2001. p. 3. doi:10.1145/500737.500740. ISBN   1-58113-380-4. S2CID   19592721.
  4. 1 2 "About CiteSeerX". Archived from the original on 2010-07-22. Retrieved 2010-05-07.
  5. "The CiteSeerX Team". Pennsylvania State University. Archived from the original on 2018-07-26. Retrieved 2018-05-01.
  6. "Ranking Web of World Repositories: Top 800 Repositories". Cybermetrics Lab. July 2010. Archived from the original on 2010-07-24. Retrieved 2010-07-24.
  7. "About CiteSeerX Data". Pennsylvania State University. Archived from the original on 2012-01-05. Retrieved 2012-01-25.
  8. For example, "CiteSeerx – DMCA Notice". CiteSeerX   10.1.1.604.4916 . Archived from the original on 2022-03-18. The document with the identifier "10.1.1.604.4916" has been removed due to a DMCA takedown notice. If you believe the removal has been in error, please contact us through the feedback page, along with the identifier mentioned in this page.
  9. Hirst, Tony (2011-12-08). "Using OAI-PMH as a Single Record Level Query Interface to Citeseer". Archived from the original on 2020-11-24. Retrieved 2020-04-25.

Further reading