Binning (metagenomics)

Last updated

In metagenomics, binning is the process of grouping reads or contigs and assigning them to individual genome. Binning methods can be based on either compositional features or alignment (similarity), or both. [1]

Contents

Introduction

Metagenomic samples can contain reads from a huge number of organisms. For example, in a single gram of soil, there can be up to 18000 different types of organisms, each with its own genome. [2] Metagenomic studies sample DNA from the whole community, and make it available as nucleotide sequences of certain length. In most cases, the incomplete nature of the obtained sequences makes it hard to assemble individual genes, [3] much less recovering the full genomes of each organism. Thus, binning techniques represent a "best effort" to identify reads or contigs within certain genomes known as Metagenome Assembled Genome (MAG). Taxonomy of MAGs can be inferred through placement into a reference phylogenetic tree using algorithms like GTDB-Tk. [4]

The first studies that sampled DNA from multiple organisms used specific genes to assess diversity and origin of each sample. [5] [6] These marker genes had been previously sequenced from clonal cultures from known organisms, so, whenever one of such genes appeared in a read or contig from the metagenomic sample that read could be assigned to a known species or to the OTU of that species. The problem with this method was that only a tiny fraction of the sequences carried a marker gene, leaving most of the data unassigned.

Modern binning techniques use both previously available information independent from the sample and intrinsic information present in the sample. Depending on the diversity and complexity of the sample, their degree of success vary: in some cases they can resolve the sequences up to individual species, while in some others the sequences are identified at best with very broad taxonomic groups. [7]

Binning of metagenomic data from various habitats might significantly extend the tree of life. Such approach on globally available metagenomes binned 52 515 individual microbial genomes and extended diversity of bacteria and archaea by 44%. [8]

Algorithms

Binning algorithms can employ previous information, and thus act as supervised classifiers, or they can try to find new groups, those act as unsupervised classifiers. Many, of course, do both. The classifiers exploit the previously known sequences by performing alignments against databases, and try to separate sequence based in organism-specific characteristics of the DNA, [9] like GC-content.

Some prominent binning algorithms for metagenomic datasets obtained through shotgun sequencing include TETRA, MEGAN, Phylopythia, SOrt-ITEMS, and DiScRIBinATE, among others. [10]

TETRA

TETRA is a statistical classifier that uses tetranucleotide usage patterns in genomic fragments. [11] There are four possible nucleotides in DNA, therefore there can be different fragments of four consecutive nucleotides; these fragments are called tetramers. TETRA works by tabulating the frequencies of each tetramer for a given sequence. From these frequencies z-scores are then calculated, which indicate how over- or under-represented the tetramer is in contraposition with what would be expected by looking to individual nucleotide compositions. The z-scores for each tetramer are assembled in a vector, and the vectors corresponding to different sequences are compared pair-wise, to yield a measure of how similar different sequences from the sample are. It is expected that the most similar sequences belong to organisms in the same OTU.

MEGAN

In the DIAMOND [12] +MEGAN [13] approach, all reads are first aligned against a protein reference database, such as NCBI-nr, and then the resulting alignments are analyzed using the naive LCA algorithm, which places a read on the lowest taxonomic node in the NCBI taxonomy that lies above all taxa to which the read has a significant alignment. Here, an alignment is usually deemed "significant", if its bit score lies above a given threshold (which depends on the length of the reads) and is within 10%, say, of the best score seen for that read. The rationale of using protein reference sequences, rather than DNA reference sequences, is that current DNA reference databases only cover a small fraction of the true diversity of genomes that exist in the environment.

Phylopythia

Phylopythia is one supervised classifier developed by researchers at IBM labs, and is basically a support vector machine trained with DNA k-mers from known sequences. [6]

SOrt-ITEMS

SOrt-ITEMS [14] is an alignment-based binning algorithm developed by Innovations Labs of Tata Consultancy Services (TCS) Ltd., India. Users need to perform a similarity search of the input metagenomic sequences (reads) against the nr protein database using BLASTx search. The generated BLASTx output is then taken as input by the SOrt-ITEMS program. The method uses a range of BLAST alignment parameter thresholds to first identify an appropriate taxonomic level (or rank) where the read can be assigned. An orthology-based approach is then adopted for the final assignment of the metagenomic read. Other alignment-based binning algorithms developed by the Innovation Labs of Tata Consultancy Services (TCS) include DiScRIBinATE, [15] ProViDE [16] and SPHINX. [17] The methodologies of these algorithms are summarized below.

DiScRIBinATE

DiScRIBinATE [15] is an alignment-based binning algorithm developed by the Innovations Labs of Tata Consultancy Services (TCS) Ltd., India. DiScRIBinATE replaces the orthology approach of SOrt-ITEMS with a quicker 'alignment-free' approach. Incorporating this alternate strategy was observed to reduce the binning time by half without any significant loss in the accuracy and specificity of assignments. Besides, a novel reclassification strategy incorporated in DiScRIBinATE was seem to reduce the overall misclassification rate.

ProViDE

ProViDE [16] is an alignment-based binning approach developed by the Innovation Labs of Tata Consultancy Services (TCS) Ltd. for the estimation of viral diversity in metagenomic samples. ProViDE adopts the reverse orthology based approach similar to SOrt-ITEMS for the taxonomic classification of metagenomic sequences obtained from virome datasets. It a customized set of BLAST parameter thresholds, specifically suited for viral metagenomic sequences. These thresholds capture the pattern of sequence divergence and the non-uniform taxonomic hierarchy observed within/across various taxonomic groups of the viral kingdom.

PCAHIER

PCAHIER, [18] another binning algorithm developed by the Georgia Institute of Technology., employs n-mer oligonucleotide frequencies as the features and adopts a hierarchical classifier (PCAHIER) for binning short metagenomic fragments. The principal component analysis was used to reduce the high dimensionality of the feature space. The effectiveness of the PCAHIER was demonstrated through comparisons against a non-hierarchical classifier, and two existing binning algorithms (TETRA and Phylopythia).

SPHINX

SPHINX, [17] another binning algorithm developed by the Innovation Labs of Tata Consultancy Services (TCS) Ltd., adopts a hybrid strategy that achieves high binning efficiency by utilizing the principles of both 'composition'- and 'alignment'-based binning algorithms. The approach was designed with the objective of analyzing metagenomic datasets as rapidly as composition-based approaches, but nevertheless with the accuracy and specificity of alignment-based algorithms. SPHINX was observed to classify metagenomic sequences as rapidly as composition-based algorithms. In addition, the binning efficiency (in terms of accuracy and specificity of assignments) of SPHINX was observed to be comparable with results obtained using alignment-based algorithms.

INDUS and TWARIT

Represent other composition-based binning algorithms developed by the Innovation Labs of Tata Consultancy Services (TCS) Ltd. These algorithms utilize a range of oligonucleotide compositional (as well as statistical) parameters to improve binning time while maintaining the accuracy and specificity of taxonomic assignments. [19] [20]

Related Research Articles

In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun.

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. It can be performed on the entire genome, transcriptome or proteome of an organism, and can also involve only selected segments or regions, like tandem repeats and transposable elements. Methodologies used include sequence alignment, searches against biological databases, and others.

In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from shotgun sequencing genomic DNA, or gene transcript (ESTs).

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

<span class="mw-page-title-main">Metagenomics</span> Study of genes found in the environment

Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microbiomics.

Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data, including both DNA and RNA sequence as well as other "post-genomic" data. These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.

<i>k</i>-mer Substrings of length k contained in a biological sequence

In bioinformatics, k-mers are substrings of length contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k-mers are composed of nucleotides, k-mers are capitalized upon to assemble DNA sequences, improve heterologous gene expression, identify species in metagenomic samples, and create attenuated vaccines. Usually, the term k-mer refers to all of a sequence's subsequences of length , such that the sequence AGAT would have four monomers, three 2-mers, two 3-mers and one 4-mer (AGAT). More generally, a sequence of length will have k-mers and total possible k-mers, where is number of possible monomers.

MEGAN is a computer program that allows optimized analysis of large metagenomic datasets.

DNA sequencing theory is the broad body of work that attempts to lay analytical foundations for determining the order of specific nucleotides in a sequence of DNA, otherwise known as DNA sequencing. The practical aspects revolve around designing and optimizing sequencing projects, predicting project performance, troubleshooting experimental results, characterizing factors such as sequence bias and the effects of software processing algorithms, and comparing various sequencing methods to one another. In this sense, it could be considered a branch of systems engineering or operations research. The permanent archive of work is primarily mathematical, although numerical calculations are often conducted for particular problems too. DNA sequencing theory addresses physical processes related to sequencing DNA and should not be confused with theories of analyzing resultant DNA sequences, e.g. sequence alignment. Publications sometimes do not make a careful distinction, but the latter are primarily concerned with algorithmic issues. Sequencing theory is based on elements of mathematics, biology, and systems engineering, so it is highly interdisciplinary. The subject may be studied within the context of computational biology.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

<span class="mw-page-title-main">Earth Microbiome Project</span>

The Earth Microbiome Project (EMP) is an initiative founded by Janet Jansson, Jack Gilbert and Rob Knight in 2010 to collect natural samples and to analyze the microbial community around the globe.

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.

<span class="mw-page-title-main">Viral metagenomics</span>

Viral metagenomics uses metagenomic technologies to detect viral genomic material from diverse environmental and clinical samples. Viruses are the most abundant biological entity and are extremely diverse; however, only a small fraction of viruses have been sequenced and only an even smaller fraction have been isolated and cultured. Sequencing viruses can be challenging because viruses lack a universally conserved marker gene so gene-based approaches are limited. Metagenomics can be used to study and analyze unculturable viruses and has been an important tool in understanding viral diversity and abundance and in the discovery of novel viruses. For example, metagenomics methods have been used to describe viruses associated with cancerous tumors and in terrestrial ecosystems.

Metatranscriptomics is the set of techniques used to study gene expression of microbes within natural environments, i.e., the metatranscriptome.

In molecular phylogenetics, relationships among individuals are determined using character traits, such as DNA, RNA or protein, which may be obtained using a variety of sequencing technologies. High-throughput next-generation sequencing has become a popular technique in transcriptomics, which represent a snapshot of gene expression. In eukaryotes, making phylogenetic inferences using RNA is complicated by alternative splicing, which produces multiple transcripts from a single gene. As such, a variety of approaches may be used to improve phylogenetic inference using transcriptomic data obtained from RNA-Seq and processed using computational phylogenetics.

Machine learning in bioinformatics is the application of machine learning algorithms to bioinformatics, including genomics, proteomics, microarrays, systems biology, evolution, and text mining.

Bloom filters are space-efficient probabilistic data structures used to test whether an element is a part of a set. Bloom filters require much less space than other data structures for representing sets, however the downside of Bloom filters is that there is a false positive rate when querying the data structure. Since multiple elements may have the same hash values for a number of hash functions, then there is a probability that querying for a non-existent element may return a positive if another element with the same hash values has been added to the Bloom filter. Assuming that the hash function has equal probability of selecting any index of the Bloom filter, the false positive rate of querying a Bloom filter is a function of the number of bits, number of hash functions and number of elements of the Bloom filter. This allows the user to manage the risk of a getting a false positive by compromising on the space benefits of the Bloom filter.

In bioinformatics, a spaced seed is a pattern of relevant and irrelevant positions in a biosequence and a method of approximate string matching that allows for substitutions. They are a straightforward modification to the earliest heuristic-based alignment efforts that allow for minor differences between the sequences of interest. Spaced seeds have been used in homology search., alignment, assembly, and metagenomics. They are usually represented as a sequence of zeroes and ones, where a one indicates relevance and a zero indicates irrelevance at the given position. Some visual representations use pound signs for relevant and dashes or asterisks for irrelevant positions.

References

  1. Maguire, Finlay; Jia, Baofeng; Gray, Kristen L.; Lau, Wing Yin Venus; Beiko, Robert G.; Brinkman, Fiona S. L. (2020-10-01). "Metagenome-assembled genome binning methods with short reads disproportionately fail for plasmids and genomic Islands". Microbial Genomics. 6 (10): mgen000436. doi: 10.1099/mgen.0.000436 . ISSN   2057-5858. PMC   7660262 . PMID   33001022.
  2. Daniel, Rolf (2005-06-01). "The metagenomics of soil". Nature Reviews Microbiology. 3 (6): 470–478. doi:10.1038/nrmicro1160. ISSN   1740-1526. PMID   15931165. S2CID   32604394.
  3. Wooley, John C.; Godzik, Adam; Friedberg, Iddo (2010-02-26). "A Primer on Metagenomics". PLOS Comput Biol. 6 (2): e1000667. Bibcode:2010PLSCB...6E0667W. doi: 10.1371/journal.pcbi.1000667 . PMC   2829047 . PMID   20195499.
  4. Chaumeil, Pierre-Alain; Mussig, Aaron J; Hugenholtz, Philip; Parks, Donovan H (2019-11-15). Hancock, John (ed.). "GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database". Bioinformatics. 36 (6): 1925–1927. doi:10.1093/bioinformatics/btz848. ISSN   1367-4803. PMC   7703759 . PMID   31730192.
  5. Giovannoni, Stephen J.; Britschgi, Theresa B.; Moyer, Craig L.; Field, Katharine G. (1990-05-03). "Genetic diversity in Sargasso Sea bacterioplankton". Nature. 345 (6270): 60–63. Bibcode:1990Natur.345...60G. doi:10.1038/345060a0. PMID   2330053. S2CID   4370502.
  6. 1 2 McHardy, Alice Carolyn; Martin, Hector Garcia; Tsirigos, Aristotelis; Hugenholtz, Philip; Rigoutsos, Isidore (January 2007). "Accurate phylogenetic classification of variable-length DNA fragments". Nature Methods. 4 (1): 63–72. doi:10.1038/nmeth976. ISSN   1548-7091. PMID   17179938. S2CID   28797816.
  7. Hickl, Oskar; Queirós, Pedro; Wilmes, Paul; May, Patrick; Heintz-Buschart, Anna (19 November 2022). "binny: an automated binning algorithm to recover high-quality genomes from complex metagenomic datasets". Briefings in Bioinformatics. 23 (6). doi:10.1093/bib/bbac431.
  8. IMG/M Data Consortium; Nayfach, Stephen; Roux, Simon; Seshadri, Rekha; Udwary, Daniel; Varghese, Neha; Schulz, Frederik; Wu, Dongying; Paez-Espino, David; Chen, I-Min; Huntemann, Marcel (2020-11-09). "A genomic catalog of Earth's microbiomes". Nature Biotechnology. 39 (4): 499–509. doi: 10.1038/s41587-020-0718-6 . ISSN   1087-0156. PMC   8041624 . PMID   33169036.
  9. Karlin, S.; I. Ladunga; B. E. Blaisdell (1994). "Heterogeneity of genomes: measures and values". Proceedings of the National Academy of Sciences. 91 (26): 12837–12841. Bibcode:1994PNAS...9112837K. doi: 10.1073/pnas.91.26.12837 . PMC   45535 . PMID   7809131.
  10. Mande, Sharmila S.; Mohammed, Monzoorul Haque; Ghosh, Tarini Shankar (1 November 2012). "Classification of metagenomic sequences: methods and challenges". Briefings in Bioinformatics. 13 (6): 669–681. doi:10.1093/bib/bbs054. PMID   22962338.
  11. Teeling, Hanno; Waldmann, Jost; Lombardot, Thierry; Bauer, Margarete; Glockner, Frank (2004). "TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences". BMC Bioinformatics. 5 (1): 163. doi: 10.1186/1471-2105-5-163 . PMC   529438 . PMID   15507136.
  12. Buchfink, Benjamin; Xie, Chao; Huson, Daniel H (January 2015). "Fast and sensitive protein alignment using DIAMOND". Nature Methods. 12 (1): 59–60. doi:10.1038/nmeth.3176. PMID   25402007. S2CID   5346781.
  13. Huson, Daniel H.; Beier, Sina; Flade, Isabell; Górska, Anna; El-Hadidi, Mohamed; Mitra, Suparna; Ruscheweyh, Hans-Joachim; Tappu, Rewati (21 June 2016). "MEGAN Community Edition - Interactive Exploration and Analysis of Large-Scale Microbiome Sequencing Data". PLOS Computational Biology. 12 (6): e1004957. Bibcode:2016PLSCB..12E4957H. doi: 10.1371/journal.pcbi.1004957 . PMC   4915700 . PMID   27327495.
  14. Monzoorul Haque, M.; Ghosh, Tarini Shankar; Komanduri, Dinakar; Mande, Sharmila S. (15 July 2009). "SOrt-ITEMS: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences". Bioinformatics. 25 (14): 1722–1730. doi:10.1093/bioinformatics/btp317. PMID   19439565.
  15. 1 2 Ghosh, Tarini Shankar; Haque M, Monzoorul; Mande, Sharmila S (October 2010). "DiScRIBinATE: a rapid method for accurate taxonomic classification of metagenomic sequences". BMC Bioinformatics. 11 (S7). doi: 10.1186/1471-2105-11-s7-s14 . PMC   2957682 . PMID   21106121.
  16. 1 2 Ghosh, Tarini Shankar; Mohammed, Monzoorul Haque; Komanduri, Dinakar; Mande, Sharmila Shekhar (22 March 2011). "ProViDE: A software tool for accurate estimation of viral diversity in metagenomic samples". Bioinformation. 6 (2): 91–94. doi:10.6026/97320630006091. PMC   3082859 . PMID   21544173.
  17. 1 2 Mohammed, Monzoorul Haque; Ghosh, Tarini Shankar; Singh, Nitin Kumar; Mande, Sharmila S. (1 January 2011). "SPHINX—an algorithm for taxonomic binning of metagenomic sequences". Bioinformatics. 27 (1): 22–30. doi:10.1093/bioinformatics/btq608. PMID   21030462.
  18. Zheng, Hao; Wu, Hongwei (December 2010). "Short prokaryotic DNA fragment binning using a hierarchical classifier based on linear discriminant analysis and principal component analysis". Journal of Bioinformatics and Computational Biology. 08 (06): 995–1011. doi:10.1142/s0219720010005051. PMID   21121023.
  19. Mohammed, Monzoorul Haque; Ghosh, Tarini Shankar; Reddy, Rachamalla Maheedhar; Reddy, Chennareddy Venkata Siva Kumar; Singh, Nitin Kumar; Mande, Sharmila S (December 2011). "INDUS - a composition-based approach for rapid and accurate taxonomic classification of metagenomic sequences". BMC Genomics. 12 (S3). doi: 10.1186/1471-2164-12-s3-s4 . PMC   3333187 . PMID   22369237.
  20. Reddy, Rachamalla Maheedhar; Mohammed, Monzoorul Haque; Mande, Sharmila S (September 2012). "TWARIT: An extremely rapid and efficient approach for phylogenetic classification of metagenomic sequences". Gene. 505 (2): 259–265. doi:10.1016/j.gene.2012.06.014. PMID   22710135.