# The a50 package

[ Tags: bioinformatics, gpl, program ] [ Propose Tags ]

a50 - a simple tool for graphing genome coverage and fragmentation.

Reads files of contigs, and compares them by plotting each as a line in a graph. The x-axis represents contig number, the y-axis represents total (cumulative) size. An ideal assembly contains a few, large contigs, thus this curve should rise steeply, and stop early (but at the expected genome size). Conversely, a poor assembly consisting of many small fragments will have a less steep curve extending far to the right.

The graphs produced by a50 gives a simple and easy to grasp comparison between assemblies, and yet produces a more detailed and informative view than the usual metrics like largest contig size or N50.

The Darcs repository is at http://malde.org/~ketil/biohaskell/a50.

## Properties

Versions 0.2, 0.4, 0.5 base (>=3 && <5), biofasta, biopsl, cmdargs (>=0.5), containers, directory, process [details] GPL Ketil Malde Ketil Malde Bioinformatics http://blog.malde.org/index.php/a50-a-graphical-comparison-of-genome-assemblies Tue Jul 23 20:36:12 UTC 2013 by KetilMalde NixOS:0.5 a50 1462 total (21 in the last 30 days) (no votes yet) [estimated by rule of succession] λ λ λ Docs not available Last success reported on 2016-10-26 Hackage Matrix CI

#### Maintainer's Corner

For package maintainers and hackage trustees

[back to package description]
a50 is a tool for comparing genome assemblies, providing a bit more
information than the usual numeric statistics, like N50.  For a quick
overview of the options, use 'a50 --help'.

General usage
-------------

To compare assemblies, you need two or more fasta-formatted files
containing contigs.  Running 'a50' with the files as arguments
produces a plot with one curve per input.  On the x-axis are the
contigs, ordered by size, and on the y-axis is the corresponding
cumulative size.  Generally, a better assembly has a steeper curve
(big contigs) which ends early (few contigs) near the y-value
corresponding to the expected genome size.

The plot is generated using 'gnuplot', so the gnuplot executable must
be in your $PATH. You can pass various information to gnuplot on the commandline, specifically you can use -f to specify output format ('terminal' in gnuplot lingo), -o to specify an output file, and -e to produce horizontal lines, e.g. at the expected genome size. For example: a50 -t pdf -o asm.pdf asm1.fasta asm2.fasta -e 8e+8 will plot the assemblies given in asm1.fasta and asm2.fasta against an expected genome size of 800Mb in PDF format to the file asm.pdf. If no output or format is specified, gnuplot will display the graph in a window. If an output file, but no format is specified, a50 will try to determine the format from the file name extension. Using EST or transcripts as a reference --------------------------------------- A similar way to compare assemblies is to measure the gene coverage of each contig. To do this, you need to have the genes available, typically in the form of raw or assembled ESTs. By specifying a set of transcripts using the -E option, e.g: a50 asm.fasta asm2.fasta -E ests.fasta will plot curves with contigs of the assemblies ordered by size along the x-axis as before, but the y-value will be the total amount of transcript data mapped to the contigs. The mapping process runs BLAT, and stores the resulting PSL files in a tempdir, typically /tmp, but overriden with the$TMPDIR variable or
specified with -T.  Only the best hit for each EST is retained, so
exons in other contigs are not counted.

Comparison to other measures
----------------------------

You can read various other measures off the graph fairly easily.
Total assembly size is the y-value at which the curve ends, and the
number of contigs in the assembly is the corresponding x-value.  An
assembly with greater average contig size will end to the left and/or
above of an assembly with shorter average contigs.  N50 is the slope
of the graph at the y-value corresponding to half the genome size.