Stockholm format

Last updated
Stockholm format
Filename extensions
.sto, .stk
Internet media type
text/x-stockholm-alignment
Developed byErik Sonnhammers
Type of format Bioinformatics
Open format?yes
Website sonnhammer.sbc.su.se/Stockholm.html

Stockholm format is a multiple sequence alignment format used by Pfam, Rfam and Dfam, to disseminate protein, RNA and DNA sequence alignments. [1] [2] [3] The alignment editors Ralee, [4] Belvu and Jalview support Stockholm format as do the probabilistic database search tools, Infernal and HMMER, and the phylogenetic analysis tool Xrate. Stockholm format files often have the filename extension .sto or .stk. [5]

Contents

Syntax

A well-formed stockholm file always contains a header which states the format and version identifier, currently '# STOCKHOLM 1.0'. The header is then followed by a multiple lines, a mix of markup (starting with #) and sequences. Finally, the "//" line indicates the end of the alignment.

An example without markup looks like:

# STOCKHOLM 1.0 #=GF ID   EXAMPLE <seqname> <aligned sequence> <seqname> <aligned sequence> <seqname> <aligned sequence> // 

Sequences are written one per line. The sequence name is written first, and after any number of whitespaces the sequence is written. Sequence names are typically in the form "name/start-end" or just "name". Sequence letters may include any characters except whitespace. Gaps may be indicated by "." or "-".

Mark-up lines start with #. The "parameters" are separated by whitespace, so an underscore ("_") instead of space should be used for the 1-char-per-column markups. Mark-up types defined include:

#=GF <feature> <Generic per-File annotation, free text> #=GC <feature> <Generic per-Column annotation, exactly 1 char per column> #=GS <seqname> <feature> <Generic per-Sequence annotation, free text> #=GR <seqname> <feature> <Generic per-Residue annotation, exactly 1 char per residue> 

These feature names are used by Pfam and Rfam for specific types of annotation. (See the Pfam and the Rfam documentation under "Description of fields")

#=GF

Pfam and Rfam may use the following tags:

   Compulsory fields:    ------------------    AC   Accession number:           Accession number in form PFxxxxx (Pfam) or RFxxxxx (Rfam).    ID   Identification:             One word name for family.    DE   Definition:                 Short description of family.    AU   Author:                     Authors of the entry.    SE   Source of seed:             The source suggesting the seed members belong to one family.    SS   Source of structure:        The source (prediction or publication) of the consensus RNA secondary structure used by Rfam.    BM   Build method:               Command line used to generate the model    SM   Search method:              Command line used to perform the search    GA   Gathering threshold:        Search threshold to build the full alignment.    TC   Trusted Cutoff:             Lowest sequence score (and domain score for Pfam) of match in the full alignment.    NC   Noise Cutoff:               Highest sequence score (and domain score for Pfam) of match not in full alignment.    TP   Type:                       Type of family -- presently Family, Domain, Motif or Repeat for Pfam.                                                    -- a tree with roots Gene, Intron or Cis-reg for Rfam.    SQ   Sequence:                   Number of sequences in alignment.     Optional fields:    ----------------    DC   Database Comment:           Comment about database reference.    DR   Database Reference:         Reference to external database.    RC   Reference Comment:          Comment about literature reference.    RN   Reference Number:           Reference Number.    RM   Reference Medline:          Eight digit medline UI number.    RT   Reference Title:            Reference Title.    RA   Reference Author:           Reference Author    RL   Reference Location:         Journal location.    PI   Previous identifier:        Record of all previous ID lines.    KW   Keywords:                   Keywords.    CC   Comment:                    Comments.    NE   Pfam accession:      Indicates a nested domain.    NL   Location:                   Location of nested domains - sequence ID, start and end of insert.    WK   Wikipedia link:             Wikipedia page    CL   Clan:                       Clan accession    MB   Membership:                 Used for listing Clan membership     For embedding trees:    ----------------    NH  New Hampshire                A tree in New Hampshire eXtended format.    TN  Tree ID                      A unique identifier for the next tree.     Other:    ------    FR False discovery Rate:         A method used to set the bit score threshold based on the ratio of                                      expected false positives to true positives. Floating point number between 0 and 1.    CB Calibration method:           Command line used to calibrate the model (Rfam only, release 12.0 and later) 

#=GS

Rfam and Pfam may use these features:

      Feature                    Description       ---------------------      -----------       AC <accession>             ACcession number       DE <freetext>              DEscription       DR <db>; <accession>;      Database Reference       OS <organism>              Organism (species)       OC <clade>                 Organism Classification (clade, etc.)       LO <look>                  Look (Color, etc.) 

#=GR

      Feature   Description            Markup letters       -------   -----------            --------------       SS        Secondary Structure    For RNA [.,;<>(){}[]AaBb.-_] --supports pseudoknot and further structure markup (see WUSS documentation)                                         For protein [HGIEBTSCX]       SA        Surface Accessibility  [0-9X]                      (0=0%-10%; ...; 9=90%-100%)       TM        TransMembrane          [Mio]       PP        Posterior Probability  [0-9*]                      (0=0.00-0.05; 1=0.05-0.15; *=0.95-1.00)       LI        LIgand binding         [*]       AS        Active Site            [*]      pAS        AS - Pfam predicted    [*]      sAS        AS - from SwissProt    [*]       IN        INtron (in or after)   [0-2]        For RNA tertiary interactions:      ------------------------------      tWW       WC/WC        in trans   For basepairs: [<>AaBb...Zz]  For unpaired: [.]      cWH       WC/Hoogsteen in cis      cWS       WC/SugarEdge in cis      tWS       WC/SugarEdge in trans      notes: (1) {c,t}{W,H,S}{W,H,S} for general format.              (2) cWW is equivalent to SS. 

#=GC

The list of valid features includes those shown below, as well as the same features as for #=GR with "_cons" appended, meaning "consensus". Example: "SS_cons".

      Feature   Description            Description       -------   -----------            --------------       RF        ReFerence annotation   Often the consensus RNA or protein sequence is used as a reference                                        Any non-gap character (e.g. x's) can indicate consensus/conserved/match columns                                        .'s or -'s indicate insert columns                                        ~'s indicate unaligned insertions                                        Upper and lower case can be used to discriminate strong and weakly conserved                                         residues respectively       MM        Model Mask             Indicates which columns in an alignment should be masked, such                                        that the emission probabilities for match states corresponding to                                        those columns will be the background distribution. 

Notes

Size limits

There are no explicit size limits on any field. However, a simple parser that uses fixed field sizes should work safely on Pfam and Rfam alignments with these limits:

Examples

A simple example of an Rfam alignment (UPSK RNA) with a pseudoknot in Stockholm format is shown below: [6]

# STOCKHOLM 1.0 #=GF ID    UPSK #=GF SE    Predicted; Infernal  #=GF SS    Published; PMID 9223489 #=GF RN    [1] #=GF RM    9223489 #=GF RT    The role of the pseudoknot at the 3' end of turnip yellow mosaic #=GF RT    virus RNA in minus-strand synthesis by the viral RNA-dependent RNA #=GF RT    polymerase. #=GF RA    Deiman BA, Kortlever RM, Pleij CW; #=GF RL    J Virol 1997;71:5990-5996.  AF035635.1/619-641             UGAGUUCUCGAUCUCUAAAAUCG M24804.1/82-104                UGAGUUCUCUAUCUCUAAAAUCG J04373.1/6212-6234             UAAGUUCUCGAUCUUUAAAAUCG M24803.1/1-23                  UAAGUUCUCGAUCUCUAAAAUCG #=GC SS_cons                   .AAA....<<<<aaa....>>>> // 

Here is a slightly more complex example showing the Pfam CBS domain:

# STOCKHOLM 1.0 #=GF ID CBS #=GF AC PF00571 #=GF DE CBS domain #=GF AU Bateman A #=GF CC CBS domains are small intracellular modules mostly found #=GF CC in 2 or four copies within a protein. #=GF SQ 5 #=GS O31698/18-71 AC O31698 #=GS O83071/192-246 AC O83071 #=GS O83071/259-312 AC O83071 #=GS O31698/88-139 AC O31698 #=GS O31698/88-139 OS Bacillus subtilis O83071/192-246          MTCRAQLIAVPRASSLAEAIACAQKMRVSRVPVYERS #=GR O83071/192-246 SA  9998877564535242525515252536463774777 O83071/259-312          MQHVSAPVFVFECTRLAYVQHKLRAHSRAVAIVLDEY #=GR O83071/259-312 SS  CCCCCHHHHHHHHHHHHHEEEEEEEEEEEEEEEEEEE O31698/18-71            MIEADKVAHVQVGNNLEHALLVLTKTGYTAIPVLDPS #=GR O31698/18-71 SS    CCCHHHHHHHHHHHHHHHEEEEEEEEEEEEEEEEHHH O31698/88-139           EVMLTDIPRLHINDPIMKGFGMVINN..GFVCVENDE #=GR O31698/88-139 SS   CCCCCCCHHHHHHHHHHHHEEEEEEEEEEEEEEEEEH #=GC SS_cons            CCCCCHHHHHHHHHHHHHEEEEEEEEEEEEEEEEEEH O31699/88-139           EVMLTDIPRLHINDPIMKGFGMVINN..GFVCVENDE #=GR O31699/88-139 AS   ________________*____________________ #=GR O31699/88-139 IN   ____________1____________2______0____ // 

See also

Related Research Articles

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the distance cost between strings in a natural language or in financial data.

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. The format originates from the FASTA software package, but has now become a near universal standard in the field of bioinformatics.

Transfer-messenger RNA

Transfer-messenger RNA is a bacterial RNA molecule with dual tRNA-like and messenger RNA-like properties. The tmRNA forms a ribonucleoprotein complex (tmRNP) together with Small Protein B (SmpB), Elongation Factor Tu (EF-Tu), and ribosomal protein S1. In trans-translation, tmRNA and its associated proteins bind to bacterial ribosomes which have stalled in the middle of protein biosynthesis, for example when reaching the end of a messenger RNA which has lost its stop codon. The tmRNA is remarkably versatile: it recycles the stalled ribosome, adds a proteolysis-inducing tag to the unfinished polypeptide, and facilitates the degradation of the aberrant messenger RNA. In the majority of bacteria these functions are carried out by standard one-piece tmRNAs. In other bacterial species, a permuted ssrA gene produces a two-piece tmRNA in which two separate RNA chains are joined by base-pairing.

The European Bioinformatics Institute (EMBL-EBI) is an Intergovernmental Organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

Conserved sequence Similar DNA, RNA or protein sequences within genomes or among species

In evolutionary biology, conserved sequences are identical or similar sequences in nucleic acids or proteins across species, or within a genome, or between donor and receptor taxa. Conservation indicates that a sequence has been maintained by natural selection.

Pfam Database of protein families

Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 34.0, was released in March 2021 and contains 19,179 families.

Nucleic acid structure prediction is a computational method to determine secondary and tertiary nucleic acid structure from its sequence. Secondary structure can be predicted from one or several nucleic acid sequences. Tertiary structure can be predicted from the sequence, or by comparative modeling.

Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated, open access database originally developed at the Wellcome Trust Sanger Institute in collaboration with Janelia Farm, and currently hosted at the European Bioinformatics Institute. Rfam is designed to be similar to the Pfam database for annotating protein families.

In bioinformatics, Stemloc is an open source software for multiple RNA sequence alignment and RNA structure prediction based on probabilistic models of RNA structure known as Pair stochastic context-free grammars. Stemloc attempts to simultaneously predict and align the structure of RNA sequences with an improved time and space cost compared to previous methods with the same motive. The resulting software implements constrained versions of the Sankoff algorithm by introducing both fold and alignment constraints, which reduces processor and memory usage and allows for larger RNA sequences to be analyzed on commodity hardware. Stemloc was written in 2004 by Ian Holmes.

SUHW4 Protein-coding gene in the species Homo sapiens

Zinc finger protein 280D, also known as Suppressor Of Hairy Wing Homolog 4, SUWH4, Zinc Finger Protein 634, ZNF634, or KIAA1584, is a protein that in humans is encoded by the ZNF280D gene located on chromosome 15q21.3.

<span class="mw-page-title-main">UGENE</span>

UGENE is computer software for bioinformatics. It works on personal computer operating systems such as Windows, macOS, or Linux. It is released as free and open-source software, under a GNU General Public License (GPL) version 2.

HMMER Software package for sequence analysis

HMMER is a free and commonly used software package for sequence analysis written by Sean Eddy. Its general usage is to identify homologous protein or nucleotide sequences, and to perform sequence alignments. It detects homology by comparing a profile-HMM to either a single sequence or a database of sequences. Sequences that score significantly better to the profile-HMM compared to a null model are considered to be homologous to the sequences that were used to construct the profile-HMM. Profile-HMMs are constructed from a multiple sequence alignment in the HMMER package using the hmmbuild program. The profile-HMM implementation used in the HMMER software was based on the work of Krogh and colleagues. HMMER is a console utility ported to every major operating system, including different versions of Linux, Windows, and Mac OS.

The UCSC Genome Browser is an online and downloadable genome browser hosted by the University of California, Santa Cruz (UCSC). It is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. The Browser is a graphical viewer optimized to support fast interactive performance and is an open-source, web-based tool suite built on top of a MySQL database for rapid visualization, examination, and querying of the data at many levels. The Genome Browser Database, browsing tools, downloadable data files, and documentation can all be found on the UCSC Genome Bioinformatics website.

Richard M. Durbin British computational biologist

Richard Michael Durbin is a British computational biologist and Al-Kindi Professor of Genetics at the University of Cambridge. He also serves as an associate faculty member at the Wellcome Sanger Institute where he was previously a senior group leader.

Protein function prediction methods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins. These proteins are usually ones that are poorly studied or predicted based on genomic sequence data. These predictions are often driven by data-intensive computational procedures. Information may come from nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interaction. Protein function is a broad term: the roles of proteins range from catalysis of biochemical reactions to transport to signal transduction, and a single protein may play a role in multiple processes or cellular pathways.

PhylomeDB is a public biological database for complete catalogs of gene phylogenies (phylomes). It allows users to interactively explore the evolutionary history of genes through the visualization of phylogenetic trees and multiple sequence alignments. Moreover, phylomeDB provides genome-wide orthology and paralogy predictions which are based on the analysis of the phylogenetic trees. The automated pipeline used to reconstruct trees aims at providing a high-quality phylogenetic analysis of different genomes, including Maximum Likelihood tree inference, alignment trimming and evolutionary model testing.

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

Alex Bateman

Alexander George Bateman is a computational biologist and Head of Protein Sequence Resources at the European Bioinformatics Institute (EBI), part of the European Molecular Biology Laboratory (EMBL) in Cambridge, UK. He has led the development of the Pfam biological database and introduced the Rfam database of RNA families. He has also been involved in the use of Wikipedia for community-based annotation of biological databases.

References

  1. Gardner PP, Daub J, Tate JG, Nawrocki EP, Kolbe DL, Lindgreen S, et al. (January 2009). "Rfam: updates to the RNA families database". Nucleic Acids Research. 37 (Database issue): D136–D140. doi:10.1093/nar/gkn766. PMC   2686503 . PMID   18953034.
  2. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, et al. (January 2008). "The Pfam protein families database". Nucleic Acids Research. 36 (Database issue): D281–D288. doi:10.1093/nar/gkm960. PMC   2238907 . PMID   18039703.
  3. Storer J, Hubley R, Rosen J, Wheeler TJ, Smit AF (January 2021). "The Dfam community resource of transposable element families, sequence models, and genome annotations". Mobile DNA. 12 (1): 2. doi:10.1186/s13100-020-00230-y. PMC   7805219 . PMID   33436076.
  4. Griffiths-Jones S (January 2005). "RALEE--RNA ALignment editor in Emacs". Bioinformatics. 21 (2): 257–259. doi: 10.1093/bioinformatics/bth489 . PMID   15377506.
  5. "Alignment Fileformats". 22 May 2019. Retrieved 22 May 2019.
  6. Deiman BA, Kortlever RM, Pleij CW (August 1997). "The role of the pseudoknot at the 3' end of turnip yellow mosaic virus RNA in minus-strand synthesis by the viral RNA-dependent RNA polymerase". Journal of Virology. 71 (8): 5990–5996. doi:10.1128/JVI.71.8.5990-5996.1997. PMC   191855 . PMID   9223489.