SAMtools

Last updated
SAMtools
Original author(s) Heng Li
Developer(s) John Marshall and Petr Danecek et al [1]
Initial release2009
Stable release
1.17 / February 21, 2023;3 months ago (2023-02-21) [2]
Repository
Written in C
Operating system Unix-like
Type Bioinformatics
License BSD, MIT
Website www.htslib.org   OOjs UI icon edit-ltr-progressive.svg

SAMtools is a set of utilities for interacting with and post-processing short DNA sequence read alignments in the SAM (Sequence Alignment/Map), BAM (Binary Alignment/Map) and CRAM formats, written by Heng Li. These files are generated as output by short read aligners like BWA. Both simple and advanced tools are provided, supporting complex tasks like variant calling and alignment viewing as well as sorting, indexing, data extraction and format conversion. [3] SAM files can be very large (tens of Gigabytes is common), so compression is used to save space. SAM files are human-readable text files, and BAM files are simply their binary equivalent, whilst CRAM files are a restructured column-oriented binary container format. BAM files are typically compressed and more efficient for software to work with than SAM. SAMtools makes it possible to work directly with a compressed BAM file, without having to uncompress the whole file. Additionally, since the format for a SAM/BAM file is somewhat complex - containing reads, references, alignments, quality information, and user-specified annotations - SAMtools reduces the effort needed to use SAM/BAM files by hiding low-level details.

Contents

As third-party projects were trying to use code from SAMtools despite it not being designed to be embedded in that way, the decision was taken in August 2014 to split the SAMtools package into a stand-alone software library with a well-defined API (HTSlib), [4] a project for variant calling and manipulation of variant data (BCFtools), and the stand-alone SAMtools package for working with sequence alignment data. [5]

Usage and commands

Like many Unix commands, SAMtool commands follow a stream model, where data runs through each command as if carried on a conveyor belt. This allows combining multiple commands into a data processing pipeline. Although the final output can be very complex, only a limited number of simple commands are needed to produce it. If not specified, the standard streams (stdin, stdout, and stderr) are assumed. Data sent to stdout are printed to the screen by default but are easily redirected to another file using the normal Unix redirectors (> and >>), or to another command via a pipe (|).

SAMtools commands

SAMtools provides the following commands, each invoked as "samtools some_command".

view
The view command filters SAM or BAM formatted data. Using options and arguments it understands what data to select (possibly all of it) and passes only that data through. Input is usually a sam or bam file specified as an argument, but could be sam or bam data piped from any other command. Possible uses include extracting a subset of data into a new file, converting between BAM and SAM formats, and just looking at the raw file contents. The order of extracted reads is preserved.
sort
The sort command sorts a BAM file based on its position in the reference, as determined by its alignment. The element + coordinate in the reference that the first matched base in the read aligns to is used as the key to order it by. [TODO: verify]. The sorted output is dumped to a new file by default, although it can be directed to stdout (using the -o option). As sorting is memory intensive and BAM files can be large, this command supports a sectioning mode (with the -m options) to use at most a given amount of memory and generate multiple output file. These files can then be merged to produce a complete sorted BAM file [TODO - investigate the details of this more carefully].
index
The index command creates a new index file that allows fast look-up of data in a (sorted) SAM or BAM. Like an index on a database, the generated *.sam.sai or *.bam.bai file allows programs that can read it to more efficiently work with the data in the associated files.
tview
The tview command starts an interactive ascii-based viewer that can be used to visualize how reads are aligned to specified small regions of the reference genome. Compared to a graphics based viewer like IGV, [6] it has few features. Within the view, it is possible to jumping to different positions along reference elements (using 'g') and display help information ('?').
mpileup
The mpileup command produces a pileup format (or BCF) file giving, for each genomic coordinate, the overlapping read bases and indels at that position in the input BAM files(s). This can be used for SNP calling for example.
flagstat

Examples

view
samtools view sample.bam > sample.sam

Convert a bam file into a sam file.

samtools view -bS sample.sam > sample.bam

Convert a sam file into a bam file. The -b option compresses or leaves compressed input data.

samtools view sample_sorted.bam "chr1:10-13"

Extract all the reads aligned to the range specified, which are those that are aligned to the reference element named chr1 and cover its 10th, 11th, 12th or 13th base. The results is saved to a BAM file including the header. An index of the input file is required for extracting reads according to their mapping position in the reference genome, as created by samtools index.

samtools view -h -b sample_sorted.bam "chr1:10-13" > tiny_sorted.bam

Extract the same reads as above, but instead of displaying them, writes them to a new bam file, tiny_sorted.bam. The -b option makes the output compressed and the -h option causes the SAM headers to be output also. These headers include a description of the reference that the reads in sample_sorted.bam were aligned to and will be needed if the tiny_sorted.bam file is to be used with some of the more advanced SAMtools commands. The order of extracted reads is preserved.

tview
samtools tview sample_sorted.bam

Start an interactive viewer to visualize a small region of the reference, the reads aligned, and mismatches. Within the view, can jump to a new location by typing g: and a location, like g:chr1:10,000,000. If the reference element name and following colon is replaced with =, the current reference element is used, i.e. if g:=10,000,200 is typed after the previous "goto" command, the viewer jumps to the region 200 base pairs down on chr1. Typing ? brings up help information for scroll movement, colors, views, ...

samtools tview -p chrM:1 sample_chrM.bam UCSC_hg38.fa

Set start position and compare.

samtools tview -d T -p chrY:10,000,000 sample_chrY.bam UCSC_hg38.fa >> save.txt
samtools tview -d H -p chrY:10,000,000 sample_chrY.bam UCSC_hg38.fa >> save.html

Save screen in .txt or .html.

sort
samtools sort -o sorted_out unsorted_in.bam

Read the specified unsorted_in.bam as input, sort it by aligned read position, and write it out to sorted_out. Type of output can be either sam, bam, or cram, and will be determined automatically by sorted_out's file-extension.

samtools sort -m 5000000 unsorted_in.bamsorted_out

Read the specified unsorted_in.bam as input, sort it in blocks up to 5 million k (5 Gb)[ units verification needed ] and write output to a series of bam files named sorted_out.0000.bam, sorted_out.0001.bam, etc., where all bam 0 reads come before any bam 1 read, etc.[ verification needed ]

index
samtools index sorted.bam

Creates an index file, sorted.bam.bai for the sorted.bam file.

See also

Related Research Articles

In computing, tar is a computer software utility for collecting many files into one archive file, often referred to as a tarball, for distribution or backup purposes. The name is derived from "tape archive", as it was originally developed to write data to sequential I/O devices with no file system of their own. The archive data sets created by tar contain various file system parameters, such as name, timestamps, ownership, file-access permissions, and directory organization. POSIX abandoned tar in favor of pax, yet tar sees continued widespread use.

compress is a Unix shell compression program based on the LZW compression algorithm. Compared to gzip's fastest setting, compress is slightly slower at compression, slighty faster at decompression, and has a significantly lower compression ratio. 1.8 MiB of memory is used to compress the Hutter Prize data, slightly more than gzip's slowest setting.

In computer programming, standard streams are interconnected input and output communication channels between a computer program and its environment when it begins execution. The three input/output (I/O) connections are called standard input (stdin), standard output (stdout) and standard error (stderr). Originally I/O happened via a physically connected system console, but standard streams abstract this. When a command is executed via an interactive shell, the streams are typically connected to the text terminal on which the shell is running, but can be changed with redirection or a pipeline. More generally, a child process inherits the standard streams of its parent process.

<span class="mw-page-title-main">Redirect (computing)</span> Form of interprocess communication

In computing, redirection is a form of interprocess communication, and is a function common to most command-line interpreters, including the various Unix shells that can redirect standard streams to user-specified locations.

pax is an archiving utility available for various operating systems and defined since 1995. Rather than sort out the incompatible options that have crept up between tar and cpio, along with their implementations across various versions of Unix, the IEEE designed new archive utility pax that could support various archive formats with useful options from both archivers. The pax command is available on Unix and Unix-like operating systems and on IBM i, and Microsoft Windows NT until Windows 2000.

In computing, tee is a command in command-line interpreters (shells) using standard streams which reads standard input and writes it to both standard output and one or more files, effectively duplicating its input. It is primarily used in conjunction with pipes and filters. The command is named after the T-splitter used in plumbing.

sort (Unix) Standard UNIX utility

In computing, sort is a standard command line program of Unix and Unix-like operating systems, that prints the lines of its input or concatenation of all files listed in its argument list in sorted order. Sorting is done based on one or more sort keys extracted from each line of input. By default, the entire input is taken as sort key. Blank space is the default field separator. The command supports a number of command-line options that can vary by implementation. For instance the "-r" flag will reverse the sort order.

Toybox is a free and open-source software implementation of over 200 Unix command line utilities such as ls, cp, and mv. The Toybox project was started in 2006, and became a 0BSD licensed BusyBox alternative. Toybox is used for most of Android's command line tools in all currently supported Android versions, and is also used to build Android on Linux and macOS. All of the tools are tested on Linux, and many of them also work on BSD and macOS.

<span class="mw-page-title-main">UGENE</span>

UGENE is computer software for bioinformatics. It works on personal computer operating systems such as Windows, macOS, or Linux. It is released as free and open-source software, under a GNU General Public License (GPL) version 2.

FASTQ format is a text-based format for storing both a biological sequence and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.

ZPAQ is an open source command line archiver for Windows and Linux. It uses a journaling or append-only format which can be rolled back to an earlier state to retrieve older versions of files and directories. It supports fast incremental update by adding only files whose last-modified date has changed since the previous update. It compresses using deduplication and several algorithms depending on the data type and the selected compression level. To preserve forward and backward compatibility between versions as the compression algorithm is improved, it stores the decompression algorithm in the archive. The ZPAQ source code includes a public domain API, libzpaq, which provides compression and decompression services to C++ applications. The format is believed to be unencumbered by patents.

Pileup format is a text-based format for summarizing the base calls of aligned reads to a reference sequence. This format facilitates visual display of SNP/indel calling and alignment. It was first used by Tony Cox and Zemin Ning at the Wellcome Trust Sanger Institute, and became widely known through its implementation within the SAMtools software suite.

Pack is a legacy Unix shell compression program based on Huffman coding.

cat (Unix) Unix command utility

cat is a standard Unix utility that reads files sequentially, writing them to standard output. The name is derived from its function to (con)catenate files. It has been ported to a number of operating systems.

Sequence Alignment Map (SAM) is a text-based format originally for storing biological sequences aligned to a reference sequence developed by Heng Li and Bob Handsaker et al. It was developed when the 1000 Genomes Project wanted to move away from the MAQ mapper format and decided to design a new format. The overall TAB-delimited flavour of the format came from an earlier format inspired by BLAT’s PSL. The name of SAM came from Gabor Marth from University of Utah, who originally had a format under the same name but with a different syntax more similar to a BLAST output. It is widely used for storing data, such as nucleotide sequences, generated by next generation sequencing technologies, and the standard has been broadened to include unmapped sequences. The format supports short and long reads (up to 128 Mbp) produced by different sequencing platforms and is used to hold mapped data within the Genome Analysis Toolkit (GATK) and across the Broad Institute, the Wellcome Sanger Institute, and throughout the 1000 Genomes Project.

<span class="mw-page-title-main">Binary Alignment Map</span>

Binary Alignment Map (BAM) is the comprehensive raw data of genome sequencing; it consists of the lossless, compressed binary representation of the Sequence Alignment Map-files.

<span class="mw-page-title-main">Cuneiform (programming language)</span> Open-source workflow language

Cuneiform is an open-source workflow language for large-scale scientific data analysis. It is a statically typed functional programming language promoting parallel computing. It features a versatile foreign function interface allowing users to integrate software from many external programming languages. At the organizational level Cuneiform provides facilities like conditional branching and general recursion making it Turing-complete. In this, Cuneiform is the attempt to close the gap between scientific workflow systems like Taverna, KNIME, or Galaxy and large-scale data analysis programming models like MapReduce or Pig Latin while offering the generality of a functional programming language.

MPEG-G is an ISO/IEC standard designed for genomic information representation by the collaboration of the ISO/IEC JTC 1/SC 29/WG 9 (MPEG) and ISO TC 276 "Biotechnology" Work Group 5. The goal of the standard is to provide interoperable solutions for data storage, access, and protection across different possible implementations for data information generated by high-throughput sequencing machines and their subsequent processing and analysis. The standard is composed of different parts, each one addressing a specific aspect, such as compression, metadata association, Application Programming Interfaces (APIs), and a reference software for data decoding. Together with the reference decoder software, commercial and open source implementations started to be available in 2019, covering progressively more of the published parts of the standard.

Compressed Reference-oriented Alignment Map (CRAM) is a compressed columnar file format for storing biological sequences aligned to a reference sequence, initially devised by Markus Hsi-Yang Fritz et al.

References

  1. "SAM tools". SourceForge.
  2. "Releases · samtools/samtools". github.com. Retrieved 2021-04-28.
  3. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. (August 2009). "The Sequence Alignment/Map format and SAMtools" (PDF). Bioinformatics. 25 (16): 2078–9. doi:10.1093/bioinformatics/btp352. PMC   2723002 . PMID   19505943.
  4. Bonfield JK, Marshall J, Danecek P, Li H, Ohan V, Whitwham A, et al. (February 2021). "HTSlib: C library for reading/writing high-throughput sequencing data". GigaScience. 10 (2). doi:10.1093/gigascience/giab007. PMC   7931820 . PMID   33594436.
  5. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. (February 2021). "Twelve years of SAMtools and BCFtools". GigaScience. 10 (2). doi:10.1093/gigascience/giab008. PMC   7931819 . PMID   33590861.
  6. IGV