GC-content

Last updated
Nucleotide bonds showing AT and GC pairs. Arrows point to the hydrogen bonds. AT-GC.jpg
Nucleotide bonds showing AT and GC pairs. Arrows point to the hydrogen bonds.

In molecular biology and genetics, GC-content (or guanine-cytosine content) is the percentage of nitrogenous bases in a DNA or RNA molecule that are either guanine (G) or cytosine (C). [1] This measure indicates the proportion of G and C bases out of an implied four total bases, also including adenine and thymine in DNA and adenine and uracil in RNA.

Contents

GC-content may be given for a certain fragment of DNA or RNA or for an entire genome. When it refers to a fragment, it may denote the GC-content of an individual gene or section of a gene (domain), a group of genes or gene clusters, a non-coding region, or a synthetic oligonucleotide such as a primer.

Structure

Qualitatively, guanine (G) and cytosine (C) undergo a specific hydrogen bonding with each other, whereas adenine (A) bonds specifically with thymine (T) in DNA and with uracil (U) in RNA. Quantitatively, each GC base pair is held together by three hydrogen bonds, while AT and AU base pairs are held together by two hydrogen bonds. To emphasize this difference, the base pairings are often represented as "G≡C" versus "A=T" or "A=U".

DNA with low GC-content is less stable than DNA with high GC-content; however, the hydrogen bonds themselves do not have a particularly significant impact on molecular stability, which is instead caused mainly by molecular interactions of base stacking. [2] In spite of the higher thermostability conferred to a nucleic acid with high GC-content, it has been observed that at least some species of bacteria with DNA of high GC-content undergo autolysis more readily, thereby reducing the longevity of the cell per se. [3] Because of the thermostability of GC pairs, it was once presumed that high GC-content was a necessary adaptation to high temperatures, but this hypothesis was refuted in 2001. [4] Even so, it has been shown that there is a strong correlation between the optimal growth of prokaryotes at higher temperatures and the GC-content of structural RNAs such as ribosomal RNA, transfer RNA, and many other non-coding RNAs. [4] [5] The AU base pairs are less stable than the GC base pairs, making high-GC-content RNA structures more resistant to the effects of high temperatures.

More recently, it has been demonstrated that the most important factor contributing to the thermal stability of double-stranded nucleic acids is actually due to the base stackings of adjacent bases rather than the number of hydrogen bonds between the bases. There is more favorable stacking energy for GC pairs than for AT or AU pairs because of the relative positions of exocyclic groups. Additionally, there is a correlation between the order in which the bases stack and the thermal stability of the molecule as a whole. [6]

Determination

Schematic karyogram of a human, showing an overview of the human genome on G banding (which includes Giemsa-staining), wherein GC rich regions are lighter and GC poor regions are darker.

.mw-parser-output .hatnote{font-style:italic}.mw-parser-output div.hatnote{padding-left:1.6em;margin-bottom:0.5em}.mw-parser-output .hatnote i{font-style:normal}.mw-parser-output .hatnote+link+.hatnote{margin-top:-0.5em}
Further information: Karyotype Human karyotype with bands and sub-bands.png
Schematic karyogram of a human, showing an overview of the human genome on G banding (which includes Giemsa-staining), wherein GC rich regions are lighter and GC poor regions are darker.

GC-content is usually expressed as a percentage value, but sometimes as a ratio (called G+C ratio or GC-ratio). GC-content percentage is calculated as [7]

whereas the AT/GC ratio is calculated as [8]

.

The GC-content percentages as well as GC-ratio can be measured by several means, but one of the simplest methods is to measure the melting temperature of the DNA double helix using spectrophotometry. The absorbance of DNA at a wavelength of 260 nm increases fairly sharply when the double-stranded DNA molecule separates into two single strands when sufficiently heated. [9] The most commonly used protocol for determining GC-ratios uses flow cytometry for large numbers of samples. [10]

In an alternative manner, if the DNA or RNA molecule under investigation has been reliably sequenced, then GC-content can be accurately calculated by simple arithmetic or by using a variety of publicly available software tools, such as the free online GC calculator.

Genomic content

Within-genome variation

The GC-ratio within a genome is found to be markedly variable. These variations in GC-ratio within the genomes of more complex organisms result in a mosaic-like formation with islet regions called isochores. [11] This results in the variations in staining intensity in chromosomes. [12] GC-rich isochores typically include many protein-coding genes within them, and thus determination of GC-ratios of these specific regions contributes to mapping gene-rich regions of the genome. [13] [14]

Coding sequences

Within a long region of genomic sequence, genes are often characterised by having a higher GC-content in contrast to the background GC-content for the entire genome. [15] There is evidence that the length of the coding region of a gene is directly proportional to higher G+C content. [16] This has been pointed to the fact that the stop codon has a bias towards A and T nucleotides, and, thus, the shorter the sequence the higher the AT bias. [17]

Comparison of more than 1,000 orthologous genes in mammals showed marked within-genome variations of the third-codon position GC content, with a range from less than 30% to more than 80%. [18]

Among-genome variation

GC content is found to be variable with different organisms, the process of which is envisaged to be contributed to by variation in selection, mutational bias, and biased recombination-associated DNA repair. [19]

The average GC-content in human genomes ranges from 35% to 60% across 100-Kb fragments, with a mean of 41%. [20] The GC-content of Yeast ( Saccharomyces cerevisiae ) is 38%, [21] and that of another common model organism, thale cress ( Arabidopsis thaliana ), is 36%. [22] Because of the nature of the genetic code, it is virtually impossible for an organism to have a genome with a GC-content approaching either 0% or 100%. However, a species with an extremely low GC-content is Plasmodium falciparum (GC% = ~20%), [23] and it is usually common to refer to such examples as being AT-rich instead of GC-poor. [24]

Several mammalian species (e.g., shrew, microbat, tenrec, rabbit) have independently undergone a marked increase in the GC-content of their genes. These GC-content changes are correlated with species life-history traits (e.g., body mass or longevity) and genome size, [18] and might be linked to a molecular phenomenon called the GC-biased gene conversion. [25]

Applications

Molecular biology

In polymerase chain reaction (PCR) experiments, the GC-content of short oligonucleotides known as primers is often used to predict their annealing temperature to the template DNA. A higher GC-content level indicates a relatively higher melting temperature.

Many sequencing technologies, such as Illumina sequencing, have trouble reading high-GC-content sequences. Bird genomes are known to have many such parts, causing the problem of "missing genes" expected to be present from evolution and phenotype but never sequenced — until improved methods were used. [26]

Systematics

The species problem in non-eukaryotic taxonomy has led to various suggestions in classifying bacteria, and the ad hoc committee on reconciliation of approaches to bacterial systematics of 1987 has recommended use of GC-ratios in higher-level hierarchical classification. [27] For example, the Actinomycetota are characterised as "high GC-content bacteria". [28] In Streptomyces coelicolor A3(2), GC-content is 72%. [29] With the use of more reliable, modern methods of molecular systematics, the GC-content definition of Actinomycetota has been abolished and low-GC bacteria of this clade have been found. [30]

Software tools

GCSpeciesSorter [31] and TopSort [32] are software tools for classifying species based on their GC-contents.

See also

Related Research Articles

<span class="mw-page-title-main">Base pair</span> Unit consisting of two nucleobases bound to each other by hydrogen bonds

A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA and RNA. Dictated by specific hydrogen bonding patterns, "Watson–Crick" base pairs allow the DNA helix to maintain a regular helical structure that is subtly dependent on its nucleotide sequence. The complementary nature of this based-paired structure provides a redundant copy of the genetic information encoded within each strand of DNA. The regular structure and data redundancy provided by the DNA double helix make DNA well suited to the storage of genetic information, while base-pairing between DNA and incoming nucleotides provides the mechanism through which DNA polymerase replicates DNA and RNA polymerase transcribes DNA into RNA. Many DNA-binding proteins can recognize specific base-pairing patterns that identify particular regulatory regions of genes.

<span class="mw-page-title-main">Stop codon</span> Codon that marks the end of a protein-coding sequence

In molecular biology, a stop codon is a codon that signals the termination of the translation process of the current protein. Most codons in messenger RNA correspond to the addition of an amino acid to a growing polypeptide chain, which may ultimately become a protein; stop codons signal the termination of this process by binding release factors, which cause the ribosomal subunits to disassociate, releasing the amino acid chain.

Molecular evolution is the process of change in the sequence composition of cellular molecules such as DNA, RNA, and proteins across generations. The field of molecular evolution uses principles of evolutionary biology and population genetics to explain patterns in these changes. Major topics in molecular evolution concern the rates and impacts of single nucleotide changes, neutral evolution vs. natural selection, origins of new genes, the genetic nature of complex traits, the genetic basis of speciation, the evolution of development, and ways that evolutionary forces influence genomic and phenotypic changes.

The coding region of a gene, also known as the coding sequence(CDS), is the portion of a gene's DNA or RNA that codes for a protein. Studying the length, composition, regulation, splicing, structures, and functions of coding regions compared to non-coding regions over different species and time periods can provide a significant amount of important information regarding gene organization and evolution of prokaryotes and eukaryotes. This can further assist in mapping the human genome and developing gene therapy.

<span class="mw-page-title-main">CpG site</span> Region of often-methylated DNA with a cytosine followed by a guanine

The CpG sites or CG sites are regions of DNA where a cytosine nucleotide is followed by a guanine nucleotide in the linear sequence of bases along its 5' → 3' direction. CpG sites occur with high frequency in genomic regions called CpG islands.

<span class="mw-page-title-main">Codon usage bias</span> Genetic bias in coding DNA

Codon usage bias refers to differences in the frequency of occurrence of synonymous codons in coding DNA. A codon is a series of three nucleotides that encodes a specific amino acid residue in a polypeptide chain or for the termination of translation.

<span class="mw-page-title-main">Nucleic acid sequence</span> Succession of nucleotides in a nucleic acid

A nucleic acid sequence is a succession of bases within the nucleotides forming alleles within a DNA or RNA (GACU) molecule. This succession is denoted by a series of a set of five different letters that indicate the order of the nucleotides. By convention, sequences are usually presented from the 5' end to the 3' end. For DNA, with its double helix, there are two possible directions for the notated sequence; of these two, the sense strand is used. Because nucleic acids are normally linear (unbranched) polymers, specifying the sequence is equivalent to defining the covalent structure of the entire molecule. For this reason, the nucleic acid sequence is also termed the primary structure.

<span class="mw-page-title-main">Molecular genetics</span> Scientific study of genes at the molecular level

Molecular genetics is a branch of biology that addresses how differences in the structures or expression of DNA molecules manifests as variation among organisms. Molecular genetics often applies an "investigative approach" to determine the structure and/or function of genes in an organism's genome using genetic screens. 

<span class="mw-page-title-main">Chargaff's rules</span> Two rules about the percentage of A, C, G, and T in DNA strands

Chargaff's rules state that in the DNA of any species and any organism, the amount of guanine should be equal to the amount of cytosine and the amount of adenine should be equal to the amount of thymine. Further, a 1:1 stoichiometric ratio of purine and pyrimidine bases should exist. This pattern is found in both strands of the DNA. They were discovered by Austrian-born chemist Erwin Chargaff in the late 1940s.

<span class="mw-page-title-main">Wobble base pair</span> RNA base pair that does not follow Watson-Crick base pair rules

A wobble base pair is a pairing between two nucleotides in RNA molecules that does not follow Watson-Crick base pair rules. The four main wobble base pairs are guanine-uracil (G-U), hypoxanthine-uracil (I-U), hypoxanthine-adenine (I-A), and hypoxanthine-cytosine (I-C). In order to maintain consistency of nucleic acid nomenclature, "I" is used for hypoxanthine because hypoxanthine is the nucleobase of inosine; nomenclature otherwise follows the names of nucleobases and their corresponding nucleosides. The thermodynamic stability of a wobble base pair is comparable to that of a Watson-Crick base pair. Wobble base pairs are fundamental in RNA secondary structure and are critical for the proper translation of the genetic code.

Gene conversion is the process by which one DNA sequence replaces a homologous sequence such that the sequences become identical after the conversion event. Gene conversion can be either allelic, meaning that one allele of the same gene replaces another allele, or ectopic, meaning that one paralogous DNA sequence converts another.

Nucleic acid thermodynamics is the study of how temperature affects the nucleic acid structure of double-stranded DNA (dsDNA). The melting temperature (Tm) is defined as the temperature at which half of the DNA strands are in the random coil or single-stranded (ssDNA) state. Tm depends on the length of the DNA molecule and its specific nucleotide sequence. DNA, when in a state where its two strands are dissociated, is referred to as having been denatured by the high temperature.

<span class="mw-page-title-main">Bisulfite sequencing</span> Lab procedure detecting 5-methylcytosines in DNA

Bisulfitesequencing (also known as bisulphite sequencing) is the use of bisulfite treatment of DNA before routine sequencing to determine the pattern of methylation. DNA methylation was the first discovered epigenetic mark, and remains the most studied. In animals it predominantly involves the addition of a methyl group to the carbon-5 position of cytosine residues of the dinucleotide CpG, and is implicated in repression of transcriptional activity.

<i>k</i>-mer Substrings of length k contained in a biological sequence

In bioinformatics, k-mers are substrings of length contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k-mers are composed of nucleotides, k-mers are capitalized upon to assemble DNA sequences, improve heterologous gene expression, identify species in metagenomic samples, and create attenuated vaccines. Usually, the term k-mer refers to all of a sequence's subsequences of length , such that the sequence AGAT would have four monomers, three 2-mers, two 3-mers and one 4-mer (AGAT). More generally, a sequence of length will have k-mers and total possible k-mers, where is number of possible monomers.

In genetics, an isochore is a large region of genomic DNA with a high degree of uniformity in GC content; that is, guanine (G) and cytosine (C) bases. The distribution of bases within a genome is non-random: different regions of the genome have different amounts of G-C base pairs, such that regions can be classified and identified by the proportion of G-C base pairs they contain.

<span class="mw-page-title-main">Nucleic acid secondary structure</span>

Nucleic acid secondary structure is the basepairing interactions within a single nucleic acid polymer or between two polymers. It can be represented as a list of bases which are paired in a nucleic acid molecule. The secondary structures of biological DNAs and RNAs tend to be different: biological DNA mostly exists as fully base paired double helices, while biological RNA is single stranded and often forms complex and intricate base-pairing interactions due to its increased ability to form hydrogen bonds stemming from the extra hydroxyl group in the ribose sugar.

<span class="mw-page-title-main">Genome evolution</span> Process by which a genome changes in structure or size over time

Genome evolution is the process by which a genome changes in structure (sequence) or size over time. The study of genome evolution involves multiple fields such as structural analysis of the genome, the study of genomic parasites, gene and ancient genome duplications, polyploidy, and comparative genomics. Genome evolution is a constantly changing and evolving field due to the steadily growing number of sequenced genomes, both prokaryotic and eukaryotic, available to the scientific community and the public at large.

<span class="mw-page-title-main">Compositional domain</span>

A compositional domain in genetics is a region of DNA with a distinct guanine (G) and cytosine (C) G-C and C-G content. The homogeneity of compositional domains is compared to that of the chromosome on which they reside. As such, compositional domains can be homogeneous or nonhomogeneous domains. Compositionally homogeneous domains that are sufficiently long are termed isochores or isochoric domains.

<span class="mw-page-title-main">GC skew</span> Over- or under-abundance of guanine and cytosine in a particular region of DNA or RNA

GC skew is when the nucleotides guanine and cytosine are over- or under-abundant in a particular region of DNA or RNA. GC skew is also a statistical method for measuring strand-specific guanine overrepresentation.

The invertebrate mitochondrial code is a genetic code used by the mitochondrial genome of invertebrates. Mitochondria contain their own DNA and reproduce independently from their host cell. Variation in translation of the mitochondrial genetic code occurs when DNA codons result in non-standard amino acids has been identified in invertebrates, most notably arthropods. This variation has been helpful as a tool to improve upon the phylogenetic tree of invertebrates, like flatworms.

References

  1. Definition of GC – content on CancerWeb of Newcastle University,UK
  2. Yakovchuk P, Protozanova E, Frank-Kamenetskii MD (2006). "Base-stacking and base-pairing contributions into thermal stability of the DNA double helix". Nucleic Acids Res. 34 (2): 564–74. doi:10.1093/nar/gkj454. PMC   1360284 . PMID   16449200.
  3. Levin RE, Van Sickle C (1976). "Autolysis of high-GC isolates of Pseudomonas putrefaciens". Antonie van Leeuwenhoek. 42 (1–2): 145–55. doi:10.1007/BF00399459. PMID   7999. S2CID   9960732.
  4. 1 2 Hurst LD, Merchant AR (March 2001). "High guanine-cytosine content is not an adaptation to high temperature: a comparative analysis amongst prokaryotes". Proc. Biol. Sci. 268 (1466): 493–7. doi:10.1098/rspb.2000.1397. PMC   1088632 . PMID   11296861.
  5. Galtier, N.; Lobry, J.R. (1997). "Relationships between genomic G+C content, RNA secondary structures, and optimal growth temperature in Prokaryotes". Journal of Molecular Evolution. 44 (6): 632–636. Bibcode:1997JMolE..44..632G. doi:10.1007/PL00006186. PMID   9169555. S2CID   19054315.
  6. Yakovchuk, Peter; Protozanova, Ekaterina; Frank-Kamenetskii, Maxim D. (2006). "Base-stacking and base-pairing contributions into thermal stability of the DNA double helix". Nucleic Acids Research. 34 (2): 564–574. doi:10.1093/nar/gkj454. ISSN   0305-1048. PMC   1360284 . PMID   16449200.
  7. Madigan,MT. and Martinko JM. (2003). Brock biology of microorganisms (10th ed.). Pearson-Prentice Hall. ISBN   978-84-205-3679-8.
  8. "Definition of GC-ratio on Northwestern University, IL, USA". Archived from the original on 20 June 2010. Retrieved 11 June 2007.
  9. Wilhelm J, Pingoud A, Hahn M (May 2003). "Real-time PCR-based method for the estimation of genome sizes". Nucleic Acids Res. 31 (10): e56. doi:10.1093/nar/gng056. PMC   156059 . PMID   12736322.
  10. Vinogradov AE (May 1994). "Measurement by flow cytometry of genomic AT/GC ratio and genome size". Cytometry. 16 (1): 34–40. doi: 10.1002/cyto.990160106 . PMID   7518377.
  11. Bernardi G (January 2000). "Isochores and the evolutionary genomics of vertebrates". Gene. 241 (1): 3–17. doi:10.1016/S0378-1119(99)00485-0. PMID   10607893.
  12. Furey TS, Haussler D (May 2003). "Integration of the cytogenetic map with the draft human genome sequence". Hum. Mol. Genet. 12 (9): 1037–44. doi: 10.1093/hmg/ddg113 . PMID   12700172.
  13. Sumner AT, de la Torre J, Stuppia L (August 1993). "The distribution of genes on chromosomes: a cytological approach". J. Mol. Evol. 37 (2): 117–22. Bibcode:1993JMolE..37..117S. doi:10.1007/BF02407346. PMID   8411200. S2CID   24677431.
  14. Aïssani B, Bernardi G (October 1991). "CpG islands, genes and isochores in the genomes of vertebrates". Gene. 106 (2): 185–95. doi:10.1016/0378-1119(91)90198-K. PMID   1937049.
  15. Romiguier J, Roux C (2017). "Analytical Biases Associated with GC-Content in Molecular Evolution". Front Genet. 8: 16. doi: 10.3389/fgene.2017.00016 . PMC   5309256 . PMID   28261263.
  16. Pozzoli U, Menozzi G, Fumagalli M, et al. (2008). "Both selective and neutral processes drive GC content evolution in the human genome". BMC Evol. Biol. 8 (1): 99. Bibcode:2008BMCEE...8...99P. doi: 10.1186/1471-2148-8-99 . PMC   2292697 . PMID   18371205.
  17. Wuitschick JD, Karrer KM (1999). "Analysis of genomic G + C content, codon usage, initiator codon context and translation termination sites in Tetrahymena thermophila". J. Eukaryot. Microbiol. 46 (3): 239–47. doi:10.1111/j.1550-7408.1999.tb05120.x. PMID   10377985. S2CID   28836138.
  18. 1 2 Romiguier, Jonathan; Ranwez, Vincent; Douzery, Emmanuel J. P.; Galtier, Nicolas (1 August 2010). "Contrasting GC-content dynamics across 33 mammalian genomes: Relationship with life-history traits and chromosome sizes". Genome Research. 20 (8): 1001–1009. doi:10.1101/gr.104372.109. ISSN   1088-9051. PMC   2909565 . PMID   20530252.
  19. Birdsell JA (1 July 2002). "Integrating genomics, bioinformatics, and classical genetics to study the effects of recombination on genome evolution". Mol. Biol. Evol. 19 (7): 1181–97. CiteSeerX   10.1.1.337.1535 . doi:10.1093/oxfordjournals.molbev.a004176. PMID   12082137.
  20. International Human Genome Sequencing Consortium (February 2001). "Initial sequencing and analysis of the human genome". Nature. 409 (6822): 860–921. Bibcode:2001Natur.409..860L. doi: 10.1038/35057062 . hdl: 2027.42/62798 . PMID   11237011. (page 876)
  21. Whole genome data of Saccharomyces cerevisiae on NCBI
  22. Whole genome data of Arabidopsis thaliana on NCBI
  23. Whole genome data of Plasmodium falciparum on NCBI
  24. Musto H, Cacciò S, Rodríguez-Maseda H, Bernardi G (1997). "Compositional constraints in the extremely GC-poor genome of Plasmodium falciparum" (PDF). Mem. Inst. Oswaldo Cruz. 92 (6): 835–41. doi: 10.1590/S0074-02761997000600020 . PMID   9566216.
  25. Duret L, Galtier N (2009). "Biased gene conversion and the evolution of mammalian genomic landscapes". Annu Rev Genom Hum Genet. 10: 285–311. doi:10.1146/annurev-genom-082908-150001. PMID   19630562. S2CID   9126286.
  26. Huttener R, Thorrez L, Veld TI, et al. (2021). "Sequencing refractory regions in bird genomes are hotspots for accelerated protein evolution". BMC Ecol Evol. 21 (176): 176. doi: 10.1186/s12862-021-01905-7 . PMC   8449477 . PMID   34537008.
  27. Wayne LG; et al. (1987). "Report of the ad hoc committee on reconciliation of approaches to bacterial systematic". International Journal of Systematic Bacteriology. 37 (4): 463–4. doi: 10.1099/00207713-37-4-463 .
  28. Taxonomy browser on NCBI
  29. Whole genome data of Streptomyces coelicolor A3(2) on NCBI
  30. Ghai R, McMahon KD, Rodriguez-Valera F (2012). "Breaking a paradigm: Cosmopolitan and abundant freshwater actinobacteria are low GC". Environmental Microbiology Reports. 4 (1): 29–35. Bibcode:2012EnvMR...4...29G. doi:10.1111/j.1758-2229.2011.00274.x. PMID   23757226.
  31. Karimi K, Wuitchik D, Oldach M, Vize P (2018). "Distinguishing Species Using GC Contents in Mixed DNA or RNA Sequences". Evol Bioinform Online. 14 (January 1, 2018): 1176934318788866. doi:10.1177/1176934318788866. PMC   6052495 . PMID   30038485.
  32. Lehnert E, Mouchka M, Burriesci M, Gallo N, Schwarz J, Pringle J (2014). "Extensive differences in gene expression between symbiotic and aposymbiotic cnidarians". G3 (Bethesda). 4 (2): 277–95. doi:10.1534/g3.113.009084. PMC   3931562 . PMID   24368779.
  1. Table with GC-content of all sequenced prokaryotes
  2. Taxonomic browser of bacteria based on GC ratio on NCBI website.
  3. GC ratio in diverse species.