xml2x: Convert BLAST output in XML format to CSV or HTML

[ bioinformatics, program ] [ Propose Tags ]

xml2x - convert blast output in XML format, either to a (csv) table suitable for e.g. importing into Excel or OOCalc, or to HTML. Optionally annotating the output with GO terms.

[Skip to Readme]

Downloads

xml2x-0.4.tar.gz [browse] (Cabal source package)
Package description (as included in the package)

Maintainer's Corner

Package maintainers

GwernBranwen, KetilMalde

For package maintainers and hackage trustees

edit package information

Candidates

No Candidates

Versions [RSS]	0.2, 0.4, 0.4.1, 0.4.2
Dependencies	array, base (>3), bio (>=0.3.4), bytestring, containers, directory, xhtml [details]
License	LicenseRef-GPL
Author	Ketil Malde
Maintainer	Ketil Malde <ketil@malde.org>
Category	Bioinformatics
Uploaded	by KetilMalde at 2009-01-30T20:11:22Z
Distributions
Reverse Dependencies	1 direct, 0 indirect [details]
Executables	xml2x
Downloads	3747 total (11 in the last 30 days)
Rating	(no votes yet) [estimated by Bayesian average]
Your Rating	λ λ λ
Status	Docs not available [build log] All reported builds failed as of 2017-01-02 [all 8 reports]

Readme for xml2x-0.4

[back to package description]

SYNOPSIS

    xml2x - convert blast output in XML format, either to a (csv)
	    table suitable for e.g. importing into Excel or OOCalc, or
	    to HTML.  Optionally annotating the output with GO terms
	    and KEGG KOs.

INSTALLATION

    The usual cabal routine, should also be possible to compile via
    the Makefile.

USAGE

    xml2x [options] xmlfile1 xmlfile2...

    Use -v if you are on an interactive terminal to keep track of
    progress.

    Output format is specified with -C (CSV) or -H (HTML), with -C
    being the default.  Note that only one output format can be used
    at a time.

CSV OUTPUT

    For CSV output, the following modes are supported

      --all    - output all blast matches (HSPs), one per line
      --top    - output only the top hit for each input sequence
      --region - output top hit for regions that overlap <50%

    Use -o to specify an output file, the default is to output to
    standard out.

HTML OUTPUT

    For HTML output, a directory called "blast.d" is created (or
    re-used if already present), and an index is constructed in a file
    named "index.html" in the current directory.  The index lists some
    information about the highest scoring blast hit, and links to the
    file displaying the alignment.

    The directory contains one HTML file per input
    sequence, and uses a HTML table to rendering the alignments.
    Color codes indicate level of identity (not total match score or
    E-value!), so short, brightly red matches may have lower score than long gray
    ones.  Frame (for BLASTX) or strand (for BLASTN) is indicated as
    text for each match.

    The files are named consistently, so if you run BLAST in both
    directions (i.e.  swapping -i and -d), you should be able to go
    back and forth by clicking on the sequence names.

ANNOTATIONS

    Options include --annotations to specify the mapping between
    UniProt accessions and GO terms.  This file is usually called
    "gene_association.goa_uniprot", and is available from the GO
    consortium [1].  The file is several GB, you may want to consider
    trimming it down a bit by filtering out the automatic (IEA)
    annotations - however, xml2x will first scan the blast output to
    extract only relevant GO annotations, so keeping it all in memory
    is not necessary.

    Additionally, you can use --ontology to specify the description of
    the GO terms, and the output will then be somewhat more
    meaningful.  The file is usually called "gene_ontology.obo",
    similarly available [2].

    You can also add KEGG annotations with the -k (or --kegg-organism)
    option. This option takes a file prefix as a parameter, and
    for a prefix $P, expects to find files $P_uniprot.list and
    $P_ko.list.  These files are read, and used to mapp KEGG KOs to
    each UniProt hit.  Available from [3].

BUGS

    XML parsing is slow, but ndm said he'd look into it.

    Must be compiled with -smp to avoid huge memory requirements, but
    the plus side is that with -smp, we use a lot less RAM than
    AutoFact.

REFERENCES

    [1] http://www.geneontology.org/ontology/gene_ontology.obo
    [2] ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/
    [3] ftp://ftp.genome.jp/pub/kegg/genes/organisms/