a50: Compare genome assemblies

[ bioinformatics, program ] [ Propose Tags ]

a50 - a simple tool for graphing genome coverage and fragmentation.

Reads files of contigs, and compares them by plotting each as a line in a graph. The x-axis represents contig number, the y-axis represents total (cumulative) size. An ideal assembly contains a few, large contigs, thus this curve should rise steeply, and stop early (but at the expected genome size). Conversely, a poor assembly consisting of many small fragments will have a less steep curve extending far to the right.

The graphs produced by a50 gives a simple and easy to grasp comparison between assemblies, and yet produces a more detailed and informative view than the usual metrics like largest contig size or N50.

The Darcs repository is at http://malde.org/~ketil/biohaskell/a50.

[Skip to Readme]

Downloads

a50-0.4.tar.gz [browse] (Cabal source package)
Package description (as included in the package)

Maintainer's Corner

Package maintainers

KetilMalde

For package maintainers and hackage trustees

edit package information

Candidates

No Candidates

Versions [RSS]	0.2, 0.4, 0.5
Dependencies	base (>=3 && <5), bio (>0.4), cmdargs (>=0.5), containers, directory, process [details]
License	LicenseRef-GPL
Author	Ketil Malde
Maintainer	Ketil Malde <ketil@malde.org>
Category	Bioinformatics
Home page	http://blog.malde.org/index.php/a50-a-graphical-comparison-of-genome-assemblies
Uploaded	by KetilMalde at 2012-12-13T18:01:36Z
Distributions
Reverse Dependencies	1 direct, 0 indirect [details]
Executables	a50
Downloads	3832 total (13 in the last 30 days)
Rating	(no votes yet) [estimated by Bayesian average]
Your Rating	λ λ λ
Status	Docs not available [build log] Successful builds reported [all 7 reports]

Readme for a50-0.4

[back to package description]

a50 is a tool for comparing genome assemblies, providing a bit more
information than the usual numeric statistics, like N50.  For a quick
overview of the options, use 'a50 --help'.


  General usage
  -------------

To compare assemblies, you need two or more fasta-formatted files
containing contigs.  Running 'a50' with the files as arguments
produces a plot with one curve per input.  On the x-axis are the
contigs, ordered by size, and on the y-axis is the corresponding
cumulative size.  Generally, a better assembly has a steeper curve
(big contigs) which ends early (few contigs) near the y-value
corresponding to the expected genome size.

The plot is generated using 'gnuplot', so the gnuplot executable must
be in your $PATH.  You can pass various information to gnuplot on the
commandline, specifically you can use -f to specify output format
('terminal' in gnuplot lingo), -o to specify an output file, and -e to
produce horizontal lines, e.g. at the expected genome size.

For example:

        a50 -t pdf -o asm.pdf asm1.fasta asm2.fasta -e 8e+8

will plot the assemblies given in asm1.fasta and asm2.fasta against an
expected genome size of 800Mb in PDF format to the file asm.pdf.  If
no output or format is specified, gnuplot will display the graph in a
window.  If an output file, but no format is specified, a50 will try
to determine the format from the file name extension.


  Using EST or transcripts as a reference
  ---------------------------------------

A similar way to compare assemblies is to measure the gene coverage of
each contig.  To do this, you need to have the genes available,
typically in the form of raw or assembled ESTs.  By specifying a set
of transcripts using the -E option, e.g:

        a50 asm.fasta asm2.fasta -E ests.fasta

will plot curves with contigs of the assemblies ordered by size along
the x-axis as before, but the y-value will be the total amount of
transcript data mapped to the contigs.

The mapping process runs BLAT, and stores the resulting PSL files in a
tempdir, typically /tmp, but overriden with the $TMPDIR variable or
specified with -T.  Only the best hit for each EST is retained, so
exons in other contigs are not counted.
 

  Comparison to other measures
  ----------------------------

You can read various other measures off the graph fairly easily.
Total assembly size is the y-value at which the curve ends, and the
number of contigs in the assembly is the corresponding x-value.  An
assembly with greater average contig size will end to the left and/or
above of an assembly with shorter average contigs.  N50 is the slope
of the graph at the y-value corresponding to half the genome size.