BLAT (bioinformatics)

Last updated
BLAT
Developer(s) Jim Kent, UCSC
Repository
Type Bioinformatics tool
License free for noncommercial use, commercial use, source available
Website genome.ucsc.edu/cgi-bin/hgBlat

BLAT (BLAST-like alignment tool) is a pairwise sequence alignment algorithm that was developed by Jim Kent at the University of California Santa Cruz (UCSC) in the early 2000s to assist in the assembly and annotation of the human genome. [1] It was designed primarily to decrease the time needed to align millions of mouse genomic reads and expressed sequence tags against the human genome sequence. The alignment tools of the time were not capable of performing these operations in a manner that would allow a regular update of the human genome assembly. Compared to pre-existing tools, BLAT was ~500 times faster with performing mRNA/DNA alignments and ~50 times faster with protein/protein alignments. [1]

Contents

Overview

BLAT is one of multiple algorithms developed for the analysis and comparison of biological sequences such as DNA, RNA and proteins, with a primary goal of inferring homology in order to discover biological function of genomic sequences. [2] It is not guaranteed to find the mathematically optimal alignment between two sequences like the classic Needleman-Wunsch [3] and Smith-Waterman [4] dynamic programming algorithms do; rather, it first attempts to rapidly detect short sequences which are more likely to be homologous, and then it aligns and further extends the homologous regions. It is similar to the heuristic BLAST [5] [6] family of algorithms, but each tool has tried to deal with the problem of aligning biological sequences in a timely and efficient manner by attempting different algorithmic techniques. [2] [7]

Uses of BLAT

BLAT can be used to align DNA sequences as well as protein and translated nucleotide (mRNA or DNA) sequences. It is designed to work best on sequences with great similarity. The DNA search is most effective for primates and the protein search is effective for land vertebrates. [1] [8] In addition, protein or translated sequence queries are more effective for identifying distant matches and for cross-species analysis than DNA sequence queries. [9] Typical uses of BLAT include the following:

BLAT is designed to find matches between sequences of length at least 40 bases that share ≥95% nucleotide identity or ≥80% translated protein identity. [9] [10]

Process

BLAT is used to find regions in a target genomic database which are similar to a query sequence under examination. The general algorithmic process followed by BLAT is similar to BLAST's in that it first searches for short segments in the database and query sequences which have a certain number of matching elements. These alignment seeds are then extended in both directions of the sequences in order to form high-scoring pairs. [12] However, BLAT uses a different indexing approach from BLAST, which allows it to rapidly scan very large genomic and protein databases for similarities to a query sequence. It does this by keeping an indexed list (hash table) of the target database in memory, which significantly reduces the time required for the comparison of the query sequences with the target database. This index is built by taking the coordinates of all the non-overlapping k-mers (words with k letters) in the target database, except for highly repeated k-mers. BLAT then builds a list of all overlapping k-mers from the query sequence and searches for these in the target database, building up a list of hits where there are matches between the sequences [1] (Figure 1 illustrates this process).

Figure 1: Example showing the creation of non-overlapping k-mers from the target database and overlapping k-mers from the query sequence, for k=3. Coordinates of the database sequences are used to clump the matches into larger alignments (full process not shown). BLAT indexing.png
Figure 1: Example showing the creation of non-overlapping k-mers from the target database and overlapping k-mers from the query sequence, for k=3. Coordinates of the database sequences are used to clump the matches into larger alignments (full process not shown).

Search stage

There are three different strategies used in order to search for candidate homologous regions:

  1. The first method requires single perfect matches between the query and database sequences i.e. the two k-mer words are exactly the same. This approach is not considered the most practical. This is because a small k-mer size is necessary in order to achieve high levels of sensitivity, but this increases the number of false positive hits, thus increasing the amount of time spent in the alignment stage of the algorithm. [1]
  2. The second method allows at least one mismatch between the two k-mer words. This decreases the amount of false positives, allowing larger k-mer sizes which are less computationally expensive to handle than those produced from the previous method. This method is very effective in identifying small homologous regions. [1]
  3. The third method requires multiple perfect matches which are in close proximity to each other. As Kent shows, [1] this is a very effective technique capable of taking into consideration small insertions and deletions within the homologous regions.

When aligning nucleotides, BLAT uses the third method requiring two perfect word matches of size 11 (11-mers). When aligning proteins, the BLAT version determines the search methodology used: when the client/server version is used, BLAT searches for three perfect 4-mer matches; when the stand-alone version is used, BLAT searches for a single perfect 5-mer between the query and database sequences. [1]

BLAT vs. BLAST

Some of the differences between BLAT and BLAST are outlined below:

Program usage

BLAT can be used either as a web-based server-client program or as a stand-alone program. [9]

Server-client

The web-based application of BLAT can be accessed from the UCSC Genome Bioinformatics Site. [8] Building the index is a relatively slow procedure. Therefore, each genome assembly used by the web-based BLAT is associated with a BLAT server, in order to have a pre-computed index available for alignments. These web-based BLAT servers keep the index in memory for users to input their query sequences. [11]

Once the query sequence is uploaded/pasted into the search field, the user can select various parameters such as which species' genome to target (there are currently over 50 species available) and the assembly version of that genome (for example, the human genome has four assemblies to select from), the query type (i.e. whether the sequence relates to DNA, protein etc.) and output settings (i.e. how to sort and visualise the output). The user can then run the search by either submitting the query or using the BLAT "I'm feeling lucky" search. [8]

Bhagwat et al. [9] provide step by step protocols for how to use BLAT to:

Input

BLAT can handle long database sequences, however, it is more effective with short query sequences than long query sequences. Kent [1] recommends a maximum query length of 200,000 bases. The UCSC browser limits query sequences to less than 25,000 letters (i.e. nucleotides) for DNA searches and less than 10,000 letters (i.e. amino acids) for protein and translated sequence searches. [8]

Figure 2: Using web-based BLAT to search a target database with a DNA query sequence. The search parameters can be seen above the query sequence BLAT Search Genome.png
Figure 2: Using web-based BLAT to search a target database with a DNA query sequence. The search parameters can be seen above the query sequence

The BLAT Search Genome available on the UCSC website accepts query sequences as text (cut and pasted into the query box) or uploaded as text files. The BLAT Search Genome can accept multiple sequences of the same type at once, up to a maximum of 25. For multiple sequences, the total number of nucleotides must not exceed 50,000 for DNA searches or 25,000 letters for protein or translated sequence searches. An example of searching a target database with a DNA query sequence is shown in Figure 2.

Output

A BLAT search returns a list of results that are ordered in decreasing order based on the score. The following information is returned: the score of the alignment, the region of query sequence that matches to the database sequence, the size of the query sequence, the level of identity as a percentage of the alignment and the chromosome and position that the query sequence maps to. [9] Bhagwat et al. [9] describe how the BLAT "Score" and "Identity" measures are calculated.

For each search result, the user is provided with a link to the UCSC Genome Browser so they can visualise the alignment on the chromosome. This a major benefit of the web-based BLAT over the stand-alone BLAT. The user is able to obtain biological information associated with the alignment, such as information about the gene to which the query may match. [9] The user is also provided with a link to view the alignment of the query sequence with the genome assembly. The matches between the query and genome assembly are blue and the boundaries of the alignments are lighter in colour. These exon boundaries indicate splice sites. [8] [9] The "I'm feeling lucky" search result returns the highest scoring alignment for the first query sequence based on the output sort option selected by the user. [8]

Stand-alone

Stand-alone BLAT is more suitable for batch runs, and more efficient than the web-based BLAT. It is more efficient because it is able to store the genome in memory, unlike the web-based application which only stores the index in memory. [1] [9]

License

Both the source and precompiled binaries of BLAT are freely available for academic and personal use. Commercial license of stand-alone BLAT is distributed by Kent Informatics, Inc.

See also

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

<span class="mw-page-title-main">National Center for Biotechnology Information</span> Database branch of the US National Library of Medicine

The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The NCBI is located in Bethesda, Maryland, and was founded in 1988 through legislation sponsored by US Congressman Claude Pepper.

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignment, searches against biological databases, and others.

In bioinformatics, BLAST is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

In bioinformatics, sequence clustering algorithms attempt to group biological sequences that are somehow related. The sequences can be either of genomic, "transcriptomic" (ESTs) or protein origin. For proteins, homologous sequences are typically grouped into families. For EST data, clustering is important to group sequences originating from the same gene before the ESTs are assembled to reconstruct the original mRNA.

FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.

In molecular biology, open reading frames (ORFs) are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames will be "open". Such an ORF may contain a start codon and by definition cannot extend beyond a stop codon. That start codon indicates where translation may start. The transcription termination site is located after the ORF, beyond the translation stop codon. If transcription were to cease before the stop codon, an incomplete protein would be made during translation.

A sequence profiling tool in bioinformatics is a type of software that presents information related to a genetic sequence, gene name, or keyword input. Such tools generally take a query such as a DNA, RNA, or protein sequence or ‘keyword’ and search one or more databases for information related to that sequence. Summaries and aggregate results are provided in standardized format describing the information that would otherwise have required visits to many smaller sites or direct literature searches to compile. Many sequence profiling tools are software portals or gateways that simplify the process of finding information about a query in the large and growing number of bioinformatics databases. The access to these kinds of tools is either web based or locally downloadable executables.

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data, including both DNA and RNA sequence as well as other "post-genomic" data. These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.

The completion of the human genome sequencing in the early 2000s was a turning point in genomics research. Scientists have conducted series of research into the activities of genes and the genome as a whole. The human genome contains around 3 billion base pairs nucleotide, and the huge quantity of data created necessitates the development of an accessible tool to explore and interpret this information in order to investigate the genetic basis of disease, evolution, and biological processes. The field of genomics has continued to grow, with new sequencing technologies and computational tool making it easier to study the genome.

<span class="mw-page-title-main">DGLUCY</span> Protein-coding gene in the species Homo sapiens

DGLUCY is a protein that in humans is encoded by the DGLUCY gene.

The Viral Bioinformatics Resource Center (VBRC) is an online resource providing access to a database of curated viral genomes and a variety of tools for bioinformatic genome analysis. This resource was one of eight BRCs funded by NIAID with the goal of promoting research against emerging and re-emerging pathogens, particularly those seen as potential bioterrorism threats. The VBRC is now supported by Dr. Chris Upton at the University of Victoria.

<span class="mw-page-title-main">HMMER</span> Software package for sequence analysis

HMMER is a free and commonly used software package for sequence analysis written by Sean Eddy. Its general usage is to identify homologous protein or nucleotide sequences, and to perform sequence alignments. It detects homology by comparing a profile-HMM to either a single sequence or a database of sequences. Sequences that score significantly better to the profile-HMM compared to a null model are considered to be homologous to the sequences that were used to construct the profile-HMM. Profile-HMMs are constructed from a multiple sequence alignment in the HMMER package using the hmmbuild program. The profile-HMM implementation used in the HMMER software was based on the work of Krogh and colleagues. HMMER is a console utility ported to every major operating system, including different versions of Linux, Windows, and macOS.

The UCSC Genome Browser is an online and downloadable genome browser hosted by the University of California, Santa Cruz (UCSC). It is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. The Browser is a graphical viewer optimized to support fast interactive performance and is an open-source, web-based tool suite built on top of a MySQL database for rapid visualization, examination, and querying of the data at many levels. The Genome Browser Database, browsing tools, downloadable data files, and documentation can all be found on the UCSC Genome Bioinformatics website.

PatternHunter is a commercially available homology search instrument software that uses sequence alignment techniques. It was initially developed in the year 2002 by three scientists: Bin Ma, John Tramp and Ming Li. These scientists were driven by the desire to solve the problem that many investigators face during studies that involve genomics and proteomics. These scientists realized that such studies greatly relied on homology studies that established short seed matches that were subsequently lengthened. Describing homologous genes was an essential part of most evolutionary studies and was crucial to the understanding of the evolution of gene families, the relationship between domains and families. Homologous genes could only be studied effectively using search tools that established like portions or local placement between two proteins or nucleic acid sequences. Homology was quantified by scores obtained from matching sequences, “mismatch and gap scores”.

Non-coding RNAs have been discovered using both experimental and bioinformatic approaches. Bioinformatic approaches can be divided into three main categories. The first involves homology search, although these techniques are by definition unable to find new classes of ncRNAs. The second category includes algorithms designed to discover specific types of ncRNAs that have similar properties. Finally, some discovery methods are based on very general properties of RNA, and are thus able to discover entirely new kinds of ncRNAs.

Bloom filters are space-efficient probabilistic data structures used to test whether an element is a part of a set. Bloom filters require much less space than other data structures for representing sets, however the downside of Bloom filters is that there is a false positive rate when querying the data structure. Since multiple elements may have the same hash values for a number of hash functions, then there is a probability that querying for a non-existent element may return a positive if another element with the same hash values has been added to the Bloom filter. Assuming that the hash function has equal probability of selecting any index of the Bloom filter, the false positive rate of querying a Bloom filter is a function of the number of bits, number of hash functions and number of elements of the Bloom filter. This allows the user to manage the risk of a getting a false positive by compromising on the space benefits of the Bloom filter.

References

  1. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Kent, W James (2002). "BLAT--the BLAST-like alignment tool". Genome Research. 12 (4): 656–664. doi:10.1101/gr.229202. PMC   187518 . PMID   11932250.
  2. 1 2 3 Imelfort, Michael (2009). Edwards, D; Stajich, J; Hansen, D (eds.). Bioinformatics: Tools and Applications . New York: Springer. pp.  19–20. ISBN   978-0-387-92737-4.
  3. Needleman, SB; Wunsch, CD (1970). "A general method applicable to the search for similarities in the amino acid sequence of two proteins". Journal of Molecular Biology. 48 (3): 443–53. doi:10.1016/0022-2836(70)90057-4. PMID   5420325.
  4. Smith, TF; Waterman, MS (1981). "Identification of common molecular subsequences". Journal of Molecular Biology. 147 (1): 195–7. CiteSeerX   10.1.1.63.2897 . doi:10.1016/0022-2836(81)90087-5. PMID   7265238.
  5. Altschul, SF; Gish, W; Miller, W; Myers, EW; Lipman, DJ (1990). "Basic local alignment search tool". Journal of Molecular Biology. 215 (3): 403–10. doi:10.1016/S0022-2836(05)80360-2. PMID   2231712. S2CID   14441902.
  6. Altschul, SF; Madden, TL; Schäffer, AA; Zhang, J; Zhang, Z; Miller, W; Lipman, DJ (1997). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs". Nucleic Acids Research. 25 (17): 3389–402. doi:10.1093/nar/25.17.3389. PMC   146917 . PMID   9254694.
  7. Baxevanis, Andreas D.; Ouellette, B.F. Francis (2001). Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins (2nd ed.). New York: Wiley-Interscience. pp.  187–214. ISBN   978-0-471-22392-4.
  8. 1 2 3 4 5 6 7 UCSC Genome Bioinformatics Site
  9. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Bhagwat, Medha; Young, Lynn; Robison, Rex R (March 2012). Using BLAT to find sequence similarity in closely related genomes. 10.8. Vol. 10. pp. 10.8.1–10.8.24. doi:10.1002/0471250953.bi1008s37. ISBN   978-0-471-25095-1. PMC   4101998 . PMID   22389010.{{cite book}}: |journal= ignored (help)
  10. 1 2 3 4 5 Ye, Shui Qing (2008). Bioinformatics: A Practical Approach . London: Chapman & Hall. pp.  11–12. ISBN   978-1-58488-810-9.
  11. 1 2 Kuhn, RM; Haussler, D; Kent, WJ (2013). "The UCSC genome browser and associated tools". Briefings in Bioinformatics. 14 (2): 144–61. doi:10.1093/bib/bbs038. PMC   3603215 . PMID   22908213.
  12. Lobo, Ingrid. "Basic Local Alignment Search Tool (BLAST)". Nature Education. Retrieved 15 October 2013.
  13. Pevsner, J (2009). Bioinformatics and Functional Genomics . New Jersey: John Wiley & Sons, Inc. pp.  166–167. ISBN   978-0-470-08585-1.
  14. "NCBI – GenBank: AACZ03015565.1" . Retrieved 12 October 2013.