Protein Information Resource

Last updated

The Protein Information Resource (PIR), located at Georgetown University Medical Center, is an integrated public bioinformatics resource to support genomic and proteomic research, and scientific studies. It contains protein sequences databases [1] [2] [3] [4] [5] [6] [7]

History

PIR was established in 1984 by the National Biomedical Research Foundation as a resource to assist researchers and customers in the identification and interpretation of protein sequence information. Prior to that, the foundation compiled the first comprehensive collection of macromolecular sequences in the Atlas of Protein Sequence and Structure, published from 1964 to 1974 under the editorship of Margaret Dayhoff. Dayhoff and her research group pioneered in the development of computer methods for the comparison of protein sequences, for the detection of distantly related sequences and duplications within sequences, and for the inference of evolutionary histories from alignments of protein sequences. [8]

Winona Barker and Robert Ledley assumed leadership of the project after the death of Dayhoff in 1983. In 1999, Cathy H. Wu joined the National Biomedical Research Foundation, and later on Georgetown University Medical Center, to head the bioinformatics efforts of PIR, and has served first as Principal Investigator and, since 2001, as Director.[ citation needed ]

For four decades, PIR has provided many protein databases and analysis tools freely accessible to the scientific community, including the Protein Sequence Database, the first international database (see PIR-International), which grew out of Atlas of Protein Sequences and Structure.[ citation needed ]

In 2002, PIR – along with its international partners, the European Bioinformatics Institute and the Swiss Institute of Bioinformatics – were awarded a grant from NIH to create UniProt, a single worldwide database of protein sequence and function, by unifying the Protein Information Resource-Protein Sequence Database, Swiss-Prot, and TrEMBL databases. As of 2010, PIR offers a wide variety of resources mainly oriented to assist the propagation and standardization of protein annotation: PIRSF, [9] iProClass, and iProLINK.

The Protein Ontology is another popular database released by the Protein Information Resource. [10] [11]

Related Research Articles

In the field of bioinformatics, a sequence database is a type of biological database that is composed of a large collection of computerized ("digital") nucleic acid sequences, protein sequences, or other polymer sequences stored on a computer. The UniProt database is an example of a protein sequence database. As of 2013 it contained over 40 million sequences and is growing at an exponential rate. Historically, sequences were published in paper form, but as the number of sequences grew, this storage method became unsustainable.

<span class="mw-page-title-main">UniProt</span> Database of protein sequences and functional information

UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organisations and a foundation from Washington, DC, United States.

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

<span class="mw-page-title-main">Pfam</span> Database of protein families

Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 35.0, was released in November 2021 and contains 19,632 families.

<span class="mw-page-title-main">Amos Bairoch</span>

Amos Bairoch is a Swiss bioinformatician and Professor of Bioinformatics at the Department of Human Protein Sciences of the University of Geneva where he leads the CALIPHO group at the Swiss Institute of Bioinformatics (SIB) combining bioinformatics, curation, and experimental efforts to functionally characterize human proteins.

InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.

<span class="mw-page-title-main">PROSITE</span> Database of protein domains, families and functional sites

PROSITE is a protein database. It consists of entries describing the protein families, domains and functional sites as well as amino acid patterns and profiles in them. These are manually curated by a team of the Swiss Institute of Bioinformatics and tightly integrated into Swiss-Prot protein annotation. PROSITE was created in 1988 by Amos Bairoch, who directed the group for more than 20 years. Since July 2018, the director of PROSITE and Swiss-Prot is Alan Bridge.

Expasy is an online bioinformatics resource operated by the SIB Swiss Institute of Bioinformatics. It is an extensible and integrative portal which provides access to over 160 databases and software tools and supports a range of life science and clinical research areas, from genomics, proteomics and structural biology, to evolution and phylogeny, systems biology and medical chemistry. The individual resources are hosted in a decentralized way by different groups of the SIB Swiss Institute of Bioinformatics and partner institutions.

<span class="mw-page-title-main">MicrobesOnline</span>

MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.

SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.

PDBsum is a database that provides an overview of the contents of each 3D macromolecular structure deposited in the Protein Data Bank. The original version of the database was developed around 1995 by Roman Laskowski and collaborators at University College London. As of 2014, PDBsum is maintained by Laskowski and collaborators in the laboratory of Janet Thornton at the European Bioinformatics Institute (EBI).

<span class="mw-page-title-main">Rolf Apweiler</span>

Rolf Apweiler is a director of European Bioinformatics Institute (EBI) part of the European Molecular Biology Laboratory (EMBL) with Ewan Birney.

<span class="mw-page-title-main">Blast2GO</span>

Blast2GO, first published in 2005, is a bioinformatics software tool for the automatic, high-throughput functional annotation of novel sequence data. It makes use of the BLAST algorithm to identify similar sequences to then transfers existing functional annotation from yet characterised sequences to the novel one. The functional information is represented via the Gene Ontology (GO), a controlled vocabulary of functional attributes. The Gene Ontology, or GO, is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species.

dcGO is a comprehensive ontology database for protein domains. As an ontology resource, dcGO integrates Open Biomedical Ontologies from a variety of contexts, ranging from functional information like Gene Ontology to others on enzymes and pathways, from phenotype information across major model organisms to information about human diseases and drugs. As a protein domain resource, dcGO includes annotations to both the individual domains and supra-domains.

In bioinformatics, the PANTHER classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. PANTHER is part of the Gene Ontology Reference Genome Project designed to classify proteins and their genes for high-throughput analysis.

Cathy H. Wu is the Edward G. Jefferson Chair and professor and director of the Center for Bioinformatics & Computational Biology (CBCB) at the University of Delaware. She is also the director of the Protein Information Resource (PIR) and the North east Bioinformatics Collaborative Steering Committee, and the adjunct professor at the Georgetown University Medical Center.

PomBase is a model organism database that provides online access to the fission yeast Schizosaccharomyces pombe genome sequence and annotated features, together with a wide range of manually curated functional gene-specific data. The PomBase website was redeveloped in 2016 to provide users with a more fully integrated, better-performing service.

Model organism databases (MODs) are biological databases, or knowledgebases, dedicated to the provision of in-depth biological data for intensively studied model organisms. MODs allow researchers to easily find background information on large sets of genes, plan experiments efficiently, combine their data with existing knowledge, and construct novel hypotheses. They allow users to analyse results and interpret datasets, and the data they generate are increasingly used to describe less well studied species. Where possible, MODs share common approaches to collect and represent biological information. For example, all MODs use the Gene Ontology (GO) to describe functions, processes and cellular locations of specific gene products. Projects also exist to enable software sharing for curation, visualization and querying between different MODs. Organismal diversity and varying user requirements however mean that MODs are often required to customize capture, display, and provision of data.

Biocuration is the field of life sciences dedicated to organizing biomedical data, information and knowledge into structured formats, such as spreadsheets, tables and knowledge graphs. The biocuration of biomedical knowledge is made possible by the cooperative work of biocurators, software developers and bioinformaticians and is at the base of the work of biological databases.

References

  1. http://pir.georgetown.edu/ Archived 2014-03-12 at the Wayback Machine Official website of PIR at Georgetown University.
  2. Wu, Cathy; Nebert, Daniel W. (2004). "Update on genome completion and annotations: Protein Information Resource". Human Genomics. 1 (3): 229–33. doi:10.1186/1479-7364-1-3-229. PMC   3525084 . PMID   15588483.
  3. Wu, C. H. (2003). "The Protein Information Resource". Nucleic Acids Research. 31 (1): 345–347. doi:10.1093/nar/gkg040. PMC   165487 . PMID   12520019.
  4. Wu, CH; Huang, H; Arminski, L; Castro-Alvear, J; Chen, Y; Hu, ZZ; Ledley, RS; Lewis, KC; Mewes, HW; Orcutt, BC; Suzek, BE; Tsugita, A; Vinayaka, CR; Yeh, LS; Zhang, J; Barker, WC (2002-01-01). "The Protein Information Resource: an integrated public resource of functional annotation of proteins". Nucleic Acids Research. 30 (1): 35–37. doi:10.1093/nar/30.1.35. ISSN   1362-4962. PMC   99125 . PMID   11752247.
  5. Barker, W. C.; Garavelli, J. S.; Hou, Z.; Huang, H.; Ledley, R. S.; McGarvey, P. B.; Mewes, H. W.; Orcutt, B. C.; Pfeiffer, F.; Tsugita, A.; Vinayaka, C. R.; Xiao, C.; Yeh, L. S.; Wu, C. (2001). "Protein Information Resource: A community resource for expert annotation of protein data". Nucleic Acids Research. 29 (1): 29–32. doi:10.1093/nar/29.1.29. PMC   29802 . PMID   11125041.
  6. Barker, W. C. (2000). "The Protein Information Resource (PIR)". Nucleic Acids Research. 28 (1): 41–44. doi:10.1093/nar/28.1.41. PMC   102418 . PMID   10592177.
  7. George, D. G.; Dodson, R. J.; Garavelli, J. S.; Haft, D. H.; Hunt, L. T.; Marzec, C. R.; Orcutt, B. C.; Sidman, K. E.; Srinivasarao, G. Y.; Yeh, L.-S. L.; Arminski, L. M.; Ledley, R. S.; Tsugita, A.; Barker, W. C. (1997). "The Protein Information Resource (PIR) and the PIR-International Protein Sequence Database". Nucleic Acids Research. 25 (1): 24–27. doi:10.1093/nar/25.1.24. PMC   146415 . PMID   9016497.
  8. Izet, M (2016). "The Most Influential Scientists in the Development of Medical informatics (13): Margaret Belle Dayhoff". Acta Inform Med. 24 (4).
  9. Wu, C. H.; Nikolskaya, A.; Huang, H.; Yeh, L. S.; Natale, D. A.; Vinayaka, C. R.; Hu, Z. Z.; Mazumder, R.; Kumar, S.; Kourtesis, P.; Ledley, R. S.; Suzek, B. E.; Arminski, L.; Chen, Y.; Zhang, J.; Cardenas, J. L.; Chung, S.; Castro-Alvear, J.; Dinkov, G.; Barker, W. C. (2004). "PIRSF: Family classification system at the Protein Information Resource". Nucleic Acids Research. 32 (90001): 112D–114. doi:10.1093/nar/gkh097. PMC   308831 . PMID   14681371.
  10. "GeorgeTown.edu - Protein Ontology". Archived from the original on 2011-03-10. Retrieved 2017-12-04.
  11. Chicco, Davide; Masseroli, Marco (2019). "Biological and Medical Ontologies: Protein Ontology (PRO)". Encyclopedia of Bioinformatics and Computational Biology. pp. 832–837. doi:10.1016/B978-0-12-809633-8.20396-8. ISBN   9780128114322. S2CID   66974875.