Open reading frame

Last updated
Sample sequence showing three different possible reading frames. Start codons are highlighted in purple, and stop codons are highlighted in red. Sampleorf.png
Sample sequence showing three different possible reading frames. Start codons are highlighted in purple, and stop codons are highlighted in red.

In molecular biology, open reading frames (ORFs) are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames will be "open" (the "reading", however, refers to the RNA produced by transcription of the DNA and its subsequent interaction with the ribosome in translation). Such an ORF may [1] contain a start codon (usually AUG in terms of RNA) and by definition cannot extend beyond a stop codon (usually UAA, UAG or UGA in RNA). [2] That start codon (not necessarily the first) indicates where translation may start. The transcription termination site is located after the ORF, beyond the translation stop codon. If transcription were to cease before the stop codon, an incomplete protein would be made during translation. [3]

Contents

In eukaryotic genes with multiple exons, introns are removed and exons are then joined together after transcription to yield the final mRNA for protein translation. In the context of gene finding, the start-stop definition of an ORF therefore only applies to spliced mRNAs, not genomic DNA, since introns may contain stop codons and/or cause shifts between reading frames. An alternative definition says that an ORF is a sequence that has a length divisible by three and is bounded by stop codons. [1] [4] This more general definition can be useful in the context of transcriptomics and metagenomics, where a start or stop codon may not be present in the obtained sequences. Such an ORF corresponds to parts of a gene rather than the complete gene.

Biological significance

One common use of open reading frames (ORFs) is as one piece of evidence to assist in gene prediction. Long ORFs are often used, along with other evidence, to initially identify candidate protein-coding regions or functional RNA-coding regions in a DNA sequence. [5] The presence of an ORF does not necessarily mean that the region is always translated. For example, in a randomly generated DNA sequence with an equal percentage of each nucleotide, a stop-codon would be expected once every 21 codons. [5] A simple gene prediction algorithm for prokaryotes might look for a start codon followed by an open reading frame that is long enough to encode a typical protein, where the codon usage of that region matches the frequency characteristic for the given organism's coding regions. [5] Therefore, some authors say that an ORF should have a minimal length, e.g. 100 codons [6] or 150 codons. [5] By itself even a long open reading frame is not conclusive evidence for the presence of a gene. [5]

Short ORFs (sORFs)

Some short ORFs (sORFs), also named Small open reading frames, [7] usually < 100 codons in length, [8] that lack the classical hallmarks of protein-coding genes (both from ncRNAs and mRNAs) can produce functional peptides. [9] 5’-UTR of about 50% of mammal mRNAs are known to contain one or several sORFs, [10] also called upstream ORFs or uORFs. However, less than 10% of the vertebrate mRNAs surveyed in an older study contained AUG codons in front of the major ORF. Interestingly, uORFs were found in two thirds of proto-oncogenes and related proteins. [11] 64–75% of experimentally found translation initiation sites of sORFs are conserved in the genomes of human and mouse and may indicate that these elements have function. [12] However, sORFs can often be found only in the minor forms of mRNAs and avoid selection; the high conservation of initiation sites may be connected with their location inside promoters of the relevant genes. This is characteristic of SLAMF1 gene, for example. [13]

Six-frame translation

Since DNA is interpreted in groups of three nucleotides (codons), a DNA strand has three distinct reading frames. [14] The double helix of a DNA molecule has two anti-parallel strands; with the two strands having three reading frames each, there are six possible frame translations. [14]

Example of a six-frame translation. The nucleotide sequence is shown in the middle with forward translations above and reverse translations below. Two possible open reading frames with the sequences are highlighted. Open reading frame.jpg
Example of a six-frame translation. The nucleotide sequence is shown in the middle with forward translations above and reverse translations below. Two possible open reading frames with the sequences are highlighted.

Software

Finder

The ORF Finder (Open Reading Frame Finder) [15] is a graphical analysis tool which finds all open reading frames of a selectable minimum size in a user's sequence or in a sequence already in the database. This tool identifies all open reading frames using the standard or alternative genetic codes. The deduced amino acid sequence can be saved in various formats and searched against the sequence database using the basic local alignment search tool (BLAST) server. The ORF Finder should be helpful in preparing complete and accurate sequence submissions. It is also packaged with the Sequin sequence submission software (sequence analyser).

Investigator

ORF Investigator [16] is a program which not only gives information about the coding and non coding sequences but also can perform pairwise global alignment of different gene/DNA regions sequences. The tool efficiently finds the ORFs for corresponding amino acid sequences and converts them into their single letter amino acid code, and provides their locations in the sequence. The pairwise global alignment between the sequences makes it convenient to detect the different mutations, including single nucleotide polymorphism. Needleman–Wunsch algorithms are used for the gene alignment. The ORF Investigator is written in the portable Perl programming language, and is therefore available to users of all common operating systems.

Predictor

OrfPredictor [17] is a web server designed for identifying protein-coding regions in expressed sequence tag (EST)-derived sequences. For query sequences with a hit in BLASTX, the program predicts the coding regions based on the translation reading frames identified in BLASTX alignments, otherwise, it predicts the most probable coding region based on the intrinsic signals of the query sequences. The output is the predicted peptide sequences in the FASTA format, and a definition line that includes the query ID, the translation reading frame and the nucleotide positions where the coding region begins and ends. OrfPredictor facilitates the annotation of EST-derived sequences, particularly, for large-scale EST projects.

ORF Predictor uses a combination of the two different ORF definitions mentioned above. It searches stretches starting with a start codon and ending at a stop codon. As an additional criterion, it searches for a stop codon in the 5' untranslated region (UTR or NTR, nontranslated region [18] ).

ORFik

ORFik is a R-package in Bioconductor for finding open reading frames and using Next generation sequencing technologies for justification of ORFs. [19] [20]

orfipy

orfipy is a tool written in Python / Cython to extract ORFs in an extremely and fast and flexible manner. [21] orfipy can work with plain or gzipped FASTA and FASTQ sequences, and provides several options to fine-tune ORF searches; these include specifying the start and stop codons, reporting partial ORFs, and using custom translation tables. The results can be saved in multiple formats, including the space-efficient BED format. orfipy is particularly faster for data containing multiple smaller FASTA sequences, such as de-novo transcriptome assemblies. [22]

See also

Related Research Articles

<span class="mw-page-title-main">Messenger RNA</span> RNA that is read by the ribosome to produce a protein

In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of synthesizing a protein.

The coding region of a gene, also known as the coding sequence(CDS), is the portion of a gene's DNA or RNA that codes for a protein. Studying the length, composition, regulation, splicing, structures, and functions of coding regions compared to non-coding regions over different species and time periods can provide a significant amount of important information regarding gene organization and evolution of prokaryotes and eukaryotes. This can further assist in mapping the human genome and developing gene therapy.

<span class="mw-page-title-main">Translation (biology)</span> Cellular process of protein synthesis

In biology, translation is the process in living cells in which proteins are produced using RNA molecules as templates. The generated protein is a sequence of amino acids. This sequence is determined by the sequence of nucleotides in the RNA. The nucleotides are considered three at a time. Each such triple results in addition of one specific amino acid to the protein being generated. The matching from nucleotide triple to amino acid is called the genetic code. The translation is performed by a large complex of functional RNA and proteins called ribosomes. The entire process is called gene expression.

<span class="mw-page-title-main">Nucleic acid sequence</span> Succession of nucleotides in a nucleic acid

A nucleic acid sequence is a succession of bases within the nucleotides forming alleles within a DNA or RNA (GACU) molecule. This succession is denoted by a series of a set of five different letters that indicate the order of the nucleotides. By convention, sequences are usually presented from the 5' end to the 3' end. For DNA, with its double helix, there are two possible directions for the notated sequence; of these two, the sense strand is used. Because nucleic acids are normally linear (unbranched) polymers, specifying the sequence is equivalent to defining the covalent structure of the entire molecule. For this reason, the nucleic acid sequence is also termed the primary structure.

<span class="mw-page-title-main">Reading frame</span> Division of RNA/DNA sequences into sets of triplets which correspond to amino acids

In molecular biology, a reading frame is a way of dividing the sequence of nucleotides in a nucleic acid molecule into a set of consecutive, non-overlapping triplets. Where these triplets equate to amino acids or stop signals during translation, they are called codons.

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

<span class="mw-page-title-main">Frameshift mutation</span> Mutation that shifts codon alignment

A frameshift mutation is a genetic mutation caused by indels of a number of nucleotides in a DNA sequence that is not divisible by three. Due to the triplet nature of gene expression by codons, the insertion or deletion can change the reading frame, resulting in a completely different translation from the original. The earlier in the sequence the deletion or insertion occurs, the more altered the protein. A frameshift mutation is not the same as a single-nucleotide polymorphism in which a nucleotide is replaced, rather than inserted or deleted. A frameshift mutation will in general cause the reading of the codons after the mutation to code for different amino acids. The frameshift mutation will also alter the first stop codon encountered in the sequence. The polypeptide being created could be abnormally short or abnormally long, and will most likely not be functional.

The 5′ untranslated region is the region of a messenger RNA (mRNA) that is directly upstream from the initiation codon. This region is important for the regulation of translation of a transcript by differing mechanisms in viruses, prokaryotes and eukaryotes. While called untranslated, the 5′ UTR or a portion of it is sometimes translated into a protein product. This product can then regulate the translation of the main coding sequence of the mRNA. In many organisms, however, the 5′ UTR is completely untranslated, instead forming a complex secondary structure to regulate translation.

<span class="mw-page-title-main">Start codon</span> First codon of a messenger RNA translated by a ribosome

The start codon is the first codon of a messenger RNA (mRNA) transcript translated by a ribosome. The start codon always codes for methionine in eukaryotes and archaea and a N-formylmethionine (fMet) in bacteria, mitochondria and plastids.

<span class="mw-page-title-main">Gene</span> Sequence of DNA or RNA that codes for an RNA or protein product

In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA, that is transcribed to produce a functional RNA. There are two types of molecular genes: protein-coding genes and non-coding genes.

In molecular biology and genetics, the sense of a nucleic acid molecule, particularly of a strand of DNA or RNA, refers to the nature of the roles of the strand and its complement in specifying a sequence of amino acids. Depending on the context, sense may have slightly different meanings. For example, the negative-sense strand of DNA is equivalent to the template strand, whereas the positive-sense strand is the non-template strand whose nucleotide sequence is equivalent to the sequence of the mRNA transcript.

<span class="mw-page-title-main">Untranslated region</span> Non-coding regions on either end of mRNA

In molecular genetics, an untranslated region refers to either of two sections, one on each side of a coding sequence on a strand of mRNA. If it is found on the 5' side, it is called the 5' UTR, or if it is found on the 3' side, it is called the 3' UTR. mRNA is RNA that carries information from DNA to the ribosome, the site of protein synthesis (translation) within a cell. The mRNA is initially transcribed from the corresponding DNA sequence and then translated into protein. However, several regions of the mRNA are usually not translated into protein, including the 5' and 3' UTRs.

Ribosomal frameshifting, also known as translational frameshifting or translational recoding, is a biological phenomenon that occurs during translation that results in the production of multiple, unique proteins from a single mRNA. The process can be programmed by the nucleotide sequence of the mRNA and is sometimes affected by the secondary, 3-dimensional mRNA structure. It has been described mainly in viruses, retrotransposons and bacterial insertion elements, and also in some cellular genes.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

The Consensus Coding Sequence (CCDS) Project is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies. The CCDS project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier, and ensures that they are consistently represented by the National Center for Biotechnology Information (NCBI), Ensembl, and UCSC Genome Browser. The integrity of the CCDS dataset is maintained through stringent quality assurance testing and on-going manual curation.

Periannan Senapathy is a molecular biologist, geneticist, author and entrepreneur. He is the founder, president and chief scientific officer at Genome International Corporation, a biotechnology, bioinformatics, and information technology firm based in Madison, Wisconsin, which develops computational genomics applications of next-generation DNA sequencing (NGS) and clinical decision support systems for analyzing patient genome data that aids in diagnosis and treatment of diseases.

An overlapping gene is a gene whose expressible nucleotide sequence partially overlaps with the expressible nucleotide sequence of another gene. In this way, a nucleotide sequence may make a contribution to the function of one or more gene products. Overlapping genes are present in and a fundamental feature of both cellular and viral genomes. The current definition of an overlapping gene varies significantly between eukaryotes, prokaryotes, and viruses. In prokaryotes and viruses overlap must be between coding sequences but not mRNA transcripts, and is defined when these coding sequences share a nucleotide on either the same or opposite strands. In eukaryotes, gene overlap is almost always defined as mRNA transcript overlap. Specifically, a gene overlap in eukaryotes is defined when at least one nucleotide is shared between the boundaries of the primary mRNA transcripts of two or more genes, such that a DNA base mutation at any point of the overlapping region would affect the transcripts of all genes involved. This definition includes 5′ and 3′ untranslated regions (UTRs) along with introns.

The vertebrate mitochondrial code is the genetic code found in the mitochondria of all vertebrata.

SEA-PHAGES stands for Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science; it was formerly called the National Genomics Research Initiative. This was the first initiative launched by the Howard Hughes Medical Institute (HHMI) Science Education Alliance (SEA) by their director Tuajuanda C. Jordan in 2008 to improve the retention of Science, technology, engineering, and mathematics (STEM) students. SEA-PHAGES is a two-semester undergraduate research program administered by the University of Pittsburgh's Graham Hatfull's group and the Howard Hughes Medical Institute's Science Education Division. Students from over 100 universities nationwide engage in authentic individual research that includes a wet-bench laboratory and a bioinformatics component.

The split gene theory is a theory of the origin of introns, long non-coding sequences in eukaryotic genes between the exons. The theory holds that the randomness of primordial DNA sequences would only permit small (< 600bp) open reading frames (ORFs), and that important intron structures and regulatory sequences are derived from stop codons. In this introns-first framework, the spliceosomal machinery and the nucleus evolved due to the necessity to join these ORFs into larger proteins, and that intronless bacterial genes are less ancestral than the split eukaryotic genes. The theory originated with Periannan Senapathy.

References

  1. 1 2 Sieber P, Platzer M, Schuster S (March 2018). "The Definition of Open Reading Frame Revisited". Trends in Genetics. 34 (3): 167–170. doi:10.1016/j.tig.2017.12.009. PMID   29366605.
  2. Brody LC (2021-08-25). "Stop Codon". National Human Genome Research Institute. National Institutes of Health. Retrieved 2021-08-25.
  3. Slonczewski J, Foster JW (2009). Microbiology: An Evolving Science. New York: W.W. Norton & Co. ISBN   978-0-393-97857-5. OCLC   185042615.
  4. Claverie JM (1997). "Computational methods for the identification of genes in vertebrate genomic sequences". Human Molecular Genetics. 6 (10): 1735–44. doi: 10.1093/hmg/6.10.1735 . PMID   9300666.
  5. 1 2 3 4 5 Deonier R, Tavaré S, Waterman M (2005). Computational Genome Analysis: an introduction. Springer-Verlag. p. 25. ISBN   978-0-387-98785-9.
  6. Claverie JM, Poirot O, Lopez F (1997). "The difficulty of identifying genes in anonymous vertebrate sequences". Computers & Chemistry. 21 (4): 203–14. doi:10.1016/s0097-8485(96)00039-3. PMID   9415985.
  7. Vakirlis, Nikolaos; Vance, Zoe; Duggan, Kate M.; McLysaght, Aoife (2022). "De novo birth of functional microproteins in the human lineage". Cell Reports. 41 (12): 111808. doi:10.1016/j.celrep.2022.111808. PMC   10073203 . PMID   36543139. S2CID   254966620.
  8. Kute, Preeti Madhav; Soukarieh, Omar; Tjeldnes, Håkon; Trégouët, David-Alexandre; Valen, Eivind (2022). "Small Open Reading Frames, How to Find Them and Determine Their Function". Frontiers in Genetics. 12: 796060. doi: 10.3389/fgene.2021.796060 . PMC   8831751 . PMID   35154250.
  9. Zanet J, Benrabah E, Li T, Pélissier-Monier A, Chanut-Delalande H, Ronsin B, et al. (September 2015). "Pri sORF peptides induce selective proteasome-mediated protein processing". Science. 349 (6254): 1356–1358. Bibcode:2015Sci...349.1356Z. doi:10.1126/science.aac5677. PMID   26383956. S2CID   206639549.
  10. Wethmar K, Barbosa-Silva A, Andrade-Navarro MA, Leutz A (January 2014). "uORFdb--a comprehensive literature database on eukaryotic uORF biology". Nucleic Acids Research. 42 (Database issue): D60–D67. doi:10.1093/nar/gkt952. PMC   3964959 . PMID   24163100.
  11. Geballe, A. P.; Morris, D. R. (April 1994). "Initiation codons within 5'-leaders of mRNAs as regulators of translation". Trends in Biochemical Sciences. 19 (4): 159–164. doi:10.1016/0968-0004(94)90277-1. ISSN   0968-0004. PMID   8016865.
  12. Lee S, Liu B, Lee S, Huang SX, Shen B, Qian SB (September 2012). "Global mapping of translation initiation sites in mammalian cells at single-nucleotide resolution". Proceedings of the National Academy of Sciences of the United States of America. 109 (37): E2424–E2432. doi: 10.1073/pnas.1207846109 . PMC   3443142 . PMID   22927429.
  13. Schwartz AM, Putlyaeva LV, Covich M, Klepikova AV, Akulich KA, Vorontsov IE, et al. (October 2016). "Early B-cell factor 1 (EBF1) is critical for transcriptional control of SLAMF1 gene in human B cells". Biochimica et Biophysica Acta (BBA) - Gene Regulatory Mechanisms. 1859 (10): 1259–1268. doi:10.1016/j.bbagrm.2016.07.004. PMID   27424222.
  14. 1 2 Pearson WR, Wood T, Zhang Z, Miller W (November 1997). "Comparison of DNA sequences with protein sequences". Genomics. 46 (1): 24–36. doi:10.1006/geno.1997.4995. PMID   9403055. S2CID   6413018.
  15. "ORFfinder". National Center for Biotechnology Information.
  16. Dhar DV, Kumar MS (2012). "ORF Investigator: A New ORF finding tool combining Pairwise Global Gene Alignment". Research Journal of Recent Sciences. 1 (11): 32–35.
  17. "OrfPredictor". bioinformatics.ysu.edu. Archived from the original on 2015-12-22. Retrieved 2015-12-17.
  18. Carrington JC, Freed DD (April 1990). "Cap-independent enhancement of translation by a plant potyvirus 5' nontranslated region". Journal of Virology. 64 (4): 1590–7. doi:10.1128/JVI.64.4.1590-1597.1990. PMC   249294 . PMID   2319646.
  19. Kornel Labun, Haakon Tjeldnes (2018). "ORFik - Open reading frames in genomics". bioconductor.org. doi:10.18129/B9.bioc.ORFik.
  20. Tjeldnes, Håkon; Labun, Kornel; Torres Cleuren, Yamila; Chyżyńska, Katarzyna; Świrski, Michał; Valen, Eivind (2021). "ORFik: A comprehensive R toolkit for the analysis of translation". BMC Bioinformatics. 22 (1): 336. doi: 10.1186/s12859-021-04254-w . PMC   8214792 . PMID   34147079.
  21. Singh U, Wurtele ES (February 2021). "orfipy: a fast and flexible tool for extracting ORFs". Bioinformatics. 37 (18): 3019–3020. doi: 10.1093/bioinformatics/btab090 . ISSN   1367-4803. PMC   8479652 . PMID   33576786.
  22. Singh U (2021-02-13), urmi-21/orfipy , retrieved 2021-02-13