Genome mining

Last updated December 04, 2023

Genome mining describes the exploitation of genomic information for the discovery of biosynthetic pathways of natural products and their possible interactions.^[1] It depends on computational technology and bioinformatics tools. The mining process relies on a huge amount of data (represented by DNA sequences and annotations) accessible in genomic databases. By applying data mining algorithms, the data can be used to generate new knowledge in several areas of medicinal chemistry,^[2]^[3] such as discovering novel natural products.^[4]

History

In the mid- to late 1980s, researchers have increasingly focused on genetic studies with the advancing sequencing technologies.^[5] The GenBank database was established in 1982 for the collection, management, storage, and distribution of DNA sequence data due to the increasing availability of DNA sequences. With the increasing number of genetic data, biotechnological companies have been able to use human DNA sequence to develop protein and antibody drugs through genome mining since 1992.^[6] In the late 1990s, many companies, such as Amgen, Immunec, Genentech were able to develop drugs that progressed to the clinical stage by adopting genome mining.^[7] Since the Human Genome Project was completed in the early 2000, researchers have been sequencing the genomes of many microorganisms.^[8] Subsequently, many of these genomes have been carefully studied to identify new genes and biosynthetic pathways.^[9]

Algorithms

As large quantities of genomic sequence data began to accumulate in public databases, genetic algorithms became important to decipher the enormous collection of genomic data. They are commonly used to generate high-quality solutions to optimization and search problems by relying on bio-inspired operators such as mutation, crossover and selection.^[10] The followings are commonly used genetic algorithms:

AntiSMASH (Antibiotics and Secondary Metabolite Analysis Shell)^[11] addresses secondary metabolite genome pipelines.^[12]
PRISM (Prediction Informatics for Secondary Metabolites)^[13] is a combinatorial approach to chemical structure prediction for genetically encoded nonribosomal peptides and type I and II polyketides.^[14]
SIM (Statistically based sequence similarity) method, such as FASTA or PSI-BLAST, infer orthologous homology.^[15]
BLAST (Basic local alignment search tool) is an approach for rapid sequence comparison.^[16]

Applications

Genome mining applies on the discovery of natural product by facilitating the characterization of novel molecules and biosynthetic pathways.^[4]^[17]

Natural product discovery

The production of natural products is regulated by the biosynthetic gene clusters (BGCs) encoded in the microorganism.^[18] By adopting genome mining, the BGCs that produce the target natural product can be predicted.^[19] Some important enzymes responsible for the formation of natural products are polyketide synthases (PKS), non-ribosomal peptide synthases (NRPS), ribosomally and post-translationally modified peptides (RiPPs), and terpenoids, and many more.^[20] Mining for enzymes, researchers can figure out the classes that BGCs encode and compare target gene clusters to known gene clusters.^[21] To verify the relation between the BGCs and natural products, the target BGCs can be expressed by suitable host through the use of molecular cloning.^[22]

Databases and tools

Genetic data has been accumulated in databases. Researchers are able to utilize algorithms to decipher the data accessible from databases for the discovery of new processes, targets, and products.^[10] The following are databases and tools:

GenBank database provides genomic datasets for analysis.^[23]
UCSC Genome Browser
AntiSMASH-DB^[11]^[24] allows comparing the sequences of newly sequenced BGCs against those of previously predicted and experimentally characterized ones.^[25]
BIG-FAM ^[26] is a biosynthetic gene cluster family database.^[27]
DoBISCUIT^[28] is a database of secondary metabolite biosynthetic gene clusters.^[29]
MIBiG (Minimum Information about a Biosynthetic Gene cluster specification)^[30] provides a standard for annotations and metadata on biosynthetic gene clusters and their molecular products.^[31]
Interactive tree of life (iTOL)^[32] is a web-based tool for the display, manipulation and annotation of phylogenetic trees.^[33]

Related Research Articles

<span class="mw-page-title-main">UniProt</span> Database of protein sequences and functional information

UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organisations and a foundation from Washington, DC, United States.

Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model organisms. Ensembl is one of several well known genome browsers for the retrieval of genomic information.

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data, including both DNA and RNA sequence as well as other "post-genomic" data. These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.

KEGG is a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances. KEGG is utilized for bioinformatics research and education, including data analysis in genomics, metagenomics, metabolomics and other omics studies, modeling and simulation in systems biology, and translational research in drug development.

The Integrated Microbial Genomes system is a genome browsing and annotation platform developed by the U.S. Department of Energy (DOE)-Joint Genome Institute. IMG contains all the draft and complete microbial genomes sequenced by the DOE-JGI integrated with other publicly available genomes. IMG provides users a set of tools for comparative analysis of microbial genomes along three dimensions: genes, genomes and functions. Users can select and transfer them in the comparative analysis carts based upon a variety of criteria. IMG also includes a genome annotation pipeline that integrates information from several tools, including KEGG, Pfam, InterPro, and the Gene Ontology, among others. Users can also type or upload their own gene annotations and the IMG system will allow them to generate Genbank or EMBL format files containing these annotations.

MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.

SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.

In bioinformatics, miRBase is a biological database that acts as an archive of microRNA sequences and annotations. As of September 2010 it contained information about 15,172 microRNAs. This number has risen to 38,589 by March 2018. The miRBase registry provides a centralised system for assigning new names to microRNA genes.

Protein function prediction methods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins. These proteins are usually ones that are poorly studied or predicted based on genomic sequence data. These predictions are often driven by data-intensive computational procedures. Information may come from nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interaction. Protein function is a broad term: the roles of proteins range from catalysis of biochemical reactions to transport to signal transduction, and a single protein may play a role in multiple processes or cellular pathways.

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

Blast2GO, first published in 2005, is a bioinformatics software tool for the automatic, high-throughput functional annotation of novel sequence data. It makes use of the BLAST algorithm to identify similar sequences to then transfers existing functional annotation from yet characterised sequences to the novel one. The functional information is represented via the Gene Ontology (GO), a controlled vocabulary of functional attributes. The Gene Ontology, or GO, is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species.

A putative gene is a segment of DNA that is believed to be a gene. Putative genes can share sequence similarities to already characterized genes and thus can be inferred to share a similar function, yet the exact function of putative genes remains unknown. Newly identified sequences are considered putative gene candidates when homologs of those sequences are found to be associated with the phenotype of interest.

In bioinformatics, a Gene Disease Database is a systematized collection of data, typically structured to model aspects of reality, in a way to comprehend the underlying mechanisms of complex diseases, by understanding multiple composite interactions between phenotype-genotype relationships and gene-disease mechanisms. Gene Disease Databases integrate human gene-disease associations from various expert curated databases and text mining derived associations including Mendelian, complex and environmental diseases.

Model organism databases (MODs) are biological databases, or knowledgebases, dedicated to the provision of in-depth biological data for intensively studied model organisms. MODs allow researchers to easily find background information on large sets of genes, plan experiments efficiently, combine their data with existing knowledge, and construct novel hypotheses. They allow users to analyse results and interpret datasets, and the data they generate are increasingly used to describe less well studied species. Where possible, MODs share common approaches to collect and represent biological information. For example, all MODs use the Gene Ontology (GO) to describe functions, processes and cellular locations of specific gene products. Projects also exist to enable software sharing for curation, visualization and querying between different MODs. Organismal diversity and varying user requirements however mean that MODs are often required to customize capture, display, and provision of data.

Machine learning in bioinformatics is the application of machine learning algorithms to bioinformatics, including genomics, proteomics, microarrays, systems biology, evolution, and text mining.

Metabolic gene clusters or biosynthetic gene clusters are tightly linked sets of mostly non-homologous genes participating in a common, discrete metabolic pathway. The genes are in physical vicinity to each other on the genome, and their expression is often coregulated. Metabolic gene clusters are common features of bacterial and most fungal genomes. They are less often found in other organisms. They are most widely known for producing secondary metabolites, the source or basis of most pharmaceutical compounds, natural toxins, chemical communication, and chemical warfare between organisms. Metabolic gene clusters are also involved in nutrient acquisition, toxin degradation, antimicrobial resistance, and vitamin biosynthesis. Given all these properties of metabolic gene clusters, they play a key role in shaping microbial ecosystems, including microbiome-host interactions. Thus several computational genomics tools have been developed to predict metabolic gene clusters.

Eriko Takano is a professor of synthetic biology and a director of the Synthetic Biology Research Centre for Fine and Speciality Chemicals (SYNBIOCHEM) at the University of Manchester. She develops antibiotics and other high-value chemicals using microbial synthetic biology tools.

Biocuration is the field of life sciences dedicated to organizing biomedical data, information and knowledge into structured formats, such as spreadsheets, tables and knowledge graphs. The biocuration of biomedical knowledge is made possible by the cooperative work of biocurators, software developers and bioinformaticians and is at the base of the work of biological databases.

References

↑ Albarano L, Esposito R, Ruocco N, Costantini M (April 2020). "Genome Mining as New Challenge in Natural Products Discovery". Marine Drugs. 18 (4): 199. doi: 10.3390/md18040199 . PMC 7230286 . PMID 32283638.
↑ Hannigan GD, Prihoda D, Palicka A, Soukup J, Klempir O, Rampula L, et al. (October 2019). "A deep learning genome-mining strategy for biosynthetic gene cluster prediction". Nucleic Acids Research. 47 (18): e110. doi:10.1093/nar/gkz654. PMC 6765103 . PMID 31400112.
↑ Lee N, Hwang S, Kim J, Cho S, Palsson B, Cho BK (2020-01-01). "Mini review: Genome mining approaches for the identification of secondary metabolite biosynthetic gene clusters in Streptomyces". Computational and Structural Biotechnology Journal. 18: 1548–1556. doi:10.1016/j.csbj.2020.06.024. PMC 7327026 . PMID 32637051.
1 2 Challis GL (May 2008). "Genome mining for novel natural product discovery". Journal of Medicinal Chemistry. 51 (9): 2618–2628. doi:10.1021/jm700948z. PMID 18393407.
↑ Bains W, Smith GC (December 1988). "A novel method for nucleic acid sequence determination". Journal of Theoretical Biology. 135 (3): 303–307. Bibcode:1988JThBi.135..303B. doi:10.1016/S0022-5193(88)80246-7. PMID 3256722.
↑ Cook-Deegan R, Heaney C (2010-09-01). "Patents in genomics and human genetics". Annual Review of Genomics and Human Genetics. 11 (1): 383–425. doi:10.1146/annurev-genom-082509-141811. PMC 2935940 . PMID 20590431.
↑ Ziemert N, Alanjary M, Weber T (August 2016). "The evolution of genome mining in microbes - a review". Natural Product Reports. 33 (8): 988–1005. doi: 10.1039/C6NP00025H . PMID 27272205.
↑ Omura S, Ikeda H, Ishikawa J, Hanamoto A, Takahashi C, Shinose M, et al. (October 2001). "Genome sequence of an industrial microorganism Streptomyces avermitilis: deducing the ability of producing secondary metabolites". Proceedings of the National Academy of Sciences of the United States of America. 98 (21): 12215–12220. Bibcode:2001PNAS...9812215O. doi: 10.1073/pnas.211433198 . PMC 59794 . PMID 11572948.
↑ Tang X, Li J, Millán-Aguiñaga N, Zhang JJ, O'Neill EC, Ugalde JA, et al. (December 2015). "Identification of Thiotetronic Acid Antibiotic Biosynthetic Pathways by Target-directed Genome Mining". ACS Chemical Biology. 10 (12): 2841–2849. doi:10.1021/acschembio.5b00658. PMC 4758359 . PMID 26458099.
1 2 Brandon MC, Wallace DC, Baldi P (July 2009). "Data structures and compression algorithms for genomic sequence data". Bioinformatics. 25 (14): 1731–1738. doi:10.1093/bioinformatics/btp319. PMC 2705231 . PMID 19447783.
1 2 "AntiSMASH-DB".
↑ Medema MH, Blin K, Cimermancic P, de Jager V, Zakrzewski P, Fischbach MA, et al. (July 2011). "antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences". Nucleic Acids Research. 39 (Web Server issue): W339–W346. doi:10.1093/nar/gkr466. PMC 3125804 . PMID 21672958.
↑ "PRISM". Adapsyn Bioscience.
↑ Skinnider MA, Johnston CW, Gunabalasingam M, Merwin NJ, Kieliszek AM, MacLellan RJ, et al. (November 2020). "Comprehensive prediction of secondary metabolite structure and biological activity from microbial genome sequences". Nature Communications. 11 (1): 6058. Bibcode:2020NatCo..11.6058S. doi:10.1038/s41467-020-19986-1. PMC 7699628 . PMID 33247171.
↑ King RD, Wise PH, Clare A (May 2004). "Confirmation of data mining based predictions of protein function". Bioinformatics. 20 (7): 1110–1118. doi: 10.1093/bioinformatics/bth047 . PMID 14764546.
↑ Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (October 1990). "Basic local alignment search tool". Journal of Molecular Biology. 215 (3): 403–410. doi:10.1016/S0022-2836(05)80360-2. PMID 2231712.
↑ Medema MH, de Rond T, Moore BS (September 2021). "Mining genomes to illuminate the specialized chemistry of life". Nature Reviews. Genetics. 22 (9): 553–571. doi:10.1038/s41576-021-00363-7. PMC 8364890 . PMID 34083778.
↑ Rutledge PJ, Challis GL (August 2015). "Discovery of microbial natural products by activation of silent biosynthetic gene clusters". Nature Reviews. Microbiology. 13 (8): 509–523. doi:10.1038/nrmicro3496. PMID 26119570. S2CID 6474118.
↑ Belknap KC, Park CJ, Barth BM, Andam CP (February 2020). "Genome mining of biosynthetic and chemotherapeutic gene clusters in Streptomyces bacteria". Scientific Reports. 10 (1): 2003. Bibcode:2020NatSR..10.2003B. doi:10.1038/s41598-020-58904-9. PMC 7005152 . PMID 32029878.
↑ Hoffmeister D, Keller NP (April 2007). "Natural products of filamentous fungi: enzymes, genes, and their regulation". Natural Product Reports. 24 (2): 393–416. doi:10.1039/B603084J. PMID 17390002.
↑ Micallef ML, D'Agostino PM, Sharma D, Viswanathan R, Moffitt MC (September 2015). "Genome mining for natural product biosynthetic gene clusters in the Subsection V cyanobacteria". BMC Genomics. 16 (1): 669. doi: 10.1186/s12864-015-1855-z . PMC 4558948 . PMID 26335778.
↑ Gomez-Escribano JP, Bibb MJ (February 2014). "Heterologous expression of natural product biosynthetic gene clusters in Streptomyces coelicolor: from genome mining to manipulation of biosynthetic pathways". Journal of Industrial Microbiology & Biotechnology. 41 (2): 425–431. doi:10.1007/s10295-013-1348-5. PMID 24096958. S2CID 15215660.
↑ Sayers EW, Cavanaugh M, Clark K, Pruitt KD, Schoch CL, Sherry ST, Karsch-Mizrachi I (January 2021). "GenBank". Nucleic Acids Research. 49 (D1): D92–D96. doi:10.1093/nar/gkaa1023. PMC 7778897 . PMID 33196830.
↑ "IMG-ABC".
↑ Palaniappan K, Chen IA, Chu K, Ratner A, Seshadri R, Kyrpides NC, et al. (January 2020). "IMG-ABC v.5.0: an update to the IMG/Atlas of Biosynthetic Gene Clusters Knowledgebase". Nucleic Acids Research. 48 (D1): D422–D430. doi:10.1093/nar/gkz932. PMC 7145673 . PMID 31665416.
↑ "BIG-FAM".
↑ Kautsar SA, Blin K, Shaw S, Weber T, Medema MH (January 2021). "BiG-FAM: the biosynthetic gene cluster families database". Nucleic Acids Research. 49 (D1): D490–D497. doi:10.1093/nar/gkaa812. PMC 7778980 . PMID 33010170.
↑ "DoBISCUIT".
↑ Ichikawa N, Sasagawa M, Yamamoto M, Komaki H, Yoshida Y, Yamazaki S, Fujita N (January 2013). "DoBISCUIT: a database of secondary metabolite biosynthetic gene clusters". Nucleic Acids Research. 41 (Database issue): D408–D414. doi:10.1093/nar/gks1177. PMC 3531092 . PMID 23185043.
↑ "MIBiG".
↑ Kautsar SA, Blin K, Shaw S, Navarro-Muñoz JC, Terlouw BR, van der Hooft JJ, et al. (January 2020). "MIBiG 2.0: a repository for biosynthetic gene clusters of known function". Nucleic Acids Research. 48 (D1): D454–D458. doi:10.1093/nar/gkz882. PMC 7145714 . PMID 31612915.
↑ "iTOL".
↑ Letunic I, Bork P (July 2016). "Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees". Nucleic Acids Research. 44 (W1): W242–W245. doi:10.1093/nar/gkw290. PMC 4987883 . PMID 27095192.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Albarano L, Esposito R, Ruocco N, Costantini M (April 2020). "Genome Mining as New Challenge in Natural Products Discovery". Marine Drugs. 18 (4): 199. doi: 10.3390/md18040199 . PMC 7230286 . PMID 32283638.

[2] Hannigan GD, Prihoda D, Palicka A, Soukup J, Klempir O, Rampula L, et al. (October 2019). "A deep learning genome-mining strategy for biosynthetic gene cluster prediction". Nucleic Acids Research. 47 (18): e110. doi:10.1093/nar/gkz654. PMC 6765103 . PMID 31400112.

[3] Lee N, Hwang S, Kim J, Cho S, Palsson B, Cho BK (2020-01-01). "Mini review: Genome mining approaches for the identification of secondary metabolite biosynthetic gene clusters in Streptomyces". Computational and Structural Biotechnology Journal. 18: 1548–1556. doi:10.1016/j.csbj.2020.06.024. PMC 7327026 . PMID 32637051.

[:0-4] 1 2 Challis GL (May 2008). "Genome mining for novel natural product discovery". Journal of Medicinal Chemistry. 51 (9): 2618–2628. doi:10.1021/jm700948z. PMID 18393407.

[5] Bains W, Smith GC (December 1988). "A novel method for nucleic acid sequence determination". Journal of Theoretical Biology. 135 (3): 303–307. Bibcode:1988JThBi.135..303B. doi:10.1016/S0022-5193(88)80246-7. PMID 3256722.

[6] Cook-Deegan R, Heaney C (2010-09-01). "Patents in genomics and human genetics". Annual Review of Genomics and Human Genetics. 11 (1): 383–425. doi:10.1146/annurev-genom-082509-141811. PMC 2935940 . PMID 20590431.

[7] Ziemert N, Alanjary M, Weber T (August 2016). "The evolution of genome mining in microbes - a review". Natural Product Reports. 33 (8): 988–1005. doi: 10.1039/C6NP00025H . PMID 27272205.

[8] Omura S, Ikeda H, Ishikawa J, Hanamoto A, Takahashi C, Shinose M, et al. (October 2001). "Genome sequence of an industrial microorganism Streptomyces avermitilis: deducing the ability of producing secondary metabolites". Proceedings of the National Academy of Sciences of the United States of America. 98 (21): 12215–12220. Bibcode:2001PNAS...9812215O. doi: 10.1073/pnas.211433198 . PMC 59794 . PMID 11572948.

[9] Tang X, Li J, Millán-Aguiñaga N, Zhang JJ, O'Neill EC, Ugalde JA, et al. (December 2015). "Identification of Thiotetronic Acid Antibiotic Biosynthetic Pathways by Target-directed Genome Mining". ACS Chemical Biology. 10 (12): 2841–2849. doi:10.1021/acschembio.5b00658. PMC 4758359 . PMID 26458099.

[Data_structures_and_compression_alg-10] 1 2 Brandon MC, Wallace DC, Baldi P (July 2009). "Data structures and compression algorithms for genomic sequence data". Bioinformatics. 25 (14): 1731–1738. doi:10.1093/bioinformatics/btp319. PMC 2705231 . PMID 19447783.

[AntiSMASH-DB-11] 1 2 "AntiSMASH-DB".

[12] Medema MH, Blin K, Cimermancic P, de Jager V, Zakrzewski P, Fischbach MA, et al. (July 2011). "antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences". Nucleic Acids Research. 39 (Web Server issue): W339–W346. doi:10.1093/nar/gkr466. PMC 3125804 . PMID 21672958.

[13] "PRISM". Adapsyn Bioscience.

[14] Skinnider MA, Johnston CW, Gunabalasingam M, Merwin NJ, Kieliszek AM, MacLellan RJ, et al. (November 2020). "Comprehensive prediction of secondary metabolite structure and biological activity from microbial genome sequences". Nature Communications. 11 (1): 6058. Bibcode:2020NatCo..11.6058S. doi:10.1038/s41467-020-19986-1. PMC 7699628 . PMID 33247171.

[15] King RD, Wise PH, Clare A (May 2004). "Confirmation of data mining based predictions of protein function". Bioinformatics. 20 (7): 1110–1118. doi: 10.1093/bioinformatics/bth047 . PMID 14764546.

[16] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (October 1990). "Basic local alignment search tool". Journal of Molecular Biology. 215 (3): 403–410. doi:10.1016/S0022-2836(05)80360-2. PMID 2231712.

[17] Medema MH, de Rond T, Moore BS (September 2021). "Mining genomes to illuminate the specialized chemistry of life". Nature Reviews. Genetics. 22 (9): 553–571. doi:10.1038/s41576-021-00363-7. PMC 8364890 . PMID 34083778.

[18] Rutledge PJ, Challis GL (August 2015). "Discovery of microbial natural products by activation of silent biosynthetic gene clusters". Nature Reviews. Microbiology. 13 (8): 509–523. doi:10.1038/nrmicro3496. PMID 26119570. S2CID 6474118.

[19] Belknap KC, Park CJ, Barth BM, Andam CP (February 2020). "Genome mining of biosynthetic and chemotherapeutic gene clusters in Streptomyces bacteria". Scientific Reports. 10 (1): 2003. Bibcode:2020NatSR..10.2003B. doi:10.1038/s41598-020-58904-9. PMC 7005152 . PMID 32029878.

[20] Hoffmeister D, Keller NP (April 2007). "Natural products of filamentous fungi: enzymes, genes, and their regulation". Natural Product Reports. 24 (2): 393–416. doi:10.1039/B603084J. PMID 17390002.

[21] Micallef ML, D'Agostino PM, Sharma D, Viswanathan R, Moffitt MC (September 2015). "Genome mining for natural product biosynthetic gene clusters in the Subsection V cyanobacteria". BMC Genomics. 16 (1): 669. doi: 10.1186/s12864-015-1855-z . PMC 4558948 . PMID 26335778.

[22] Gomez-Escribano JP, Bibb MJ (February 2014). "Heterologous expression of natural product biosynthetic gene clusters in Streptomyces coelicolor: from genome mining to manipulation of biosynthetic pathways". Journal of Industrial Microbiology & Biotechnology. 41 (2): 425–431. doi:10.1007/s10295-013-1348-5. PMID 24096958. S2CID 15215660.

[Sayers_2021-23] Sayers EW, Cavanaugh M, Clark K, Pruitt KD, Schoch CL, Sherry ST, Karsch-Mizrachi I (January 2021). "GenBank". Nucleic Acids Research. 49 (D1): D92–D96. doi:10.1093/nar/gkaa1023. PMC 7778897 . PMID 33196830.

[24] "IMG-ABC".

[25] Palaniappan K, Chen IA, Chu K, Ratner A, Seshadri R, Kyrpides NC, et al. (January 2020). "IMG-ABC v.5.0: an update to the IMG/Atlas of Biosynthetic Gene Clusters Knowledgebase". Nucleic Acids Research. 48 (D1): D422–D430. doi:10.1093/nar/gkz932. PMC 7145673 . PMID 31665416.

[26] "BIG-FAM".

[27] Kautsar SA, Blin K, Shaw S, Weber T, Medema MH (January 2021). "BiG-FAM: the biosynthetic gene cluster families database". Nucleic Acids Research. 49 (D1): D490–D497. doi:10.1093/nar/gkaa812. PMC 7778980 . PMID 33010170.

[28] "DoBISCUIT".

[29] Ichikawa N, Sasagawa M, Yamamoto M, Komaki H, Yoshida Y, Yamazaki S, Fujita N (January 2013). "DoBISCUIT: a database of secondary metabolite biosynthetic gene clusters". Nucleic Acids Research. 41 (Database issue): D408–D414. doi:10.1093/nar/gks1177. PMC 3531092 . PMID 23185043.

[30] "MIBiG".

[pmid31612915-31] Kautsar SA, Blin K, Shaw S, Navarro-Muñoz JC, Terlouw BR, van der Hooft JJ, et al. (January 2020). "MIBiG 2.0: a repository for biosynthetic gene clusters of known function". Nucleic Acids Research. 48 (D1): D454–D458. doi:10.1093/nar/gkz882. PMC 7145714 . PMID 31612915.

[32] "iTOL".

[33] Letunic I, Bork P (July 2016). "Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees". Nucleic Acids Research. 44 (W1): W242–W245. doi:10.1093/nar/gkw290. PMC 4987883 . PMID 27095192.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]