Haskell XML Toolbox 8.3.2

Introduction
Description
Documentation
Requirements
Download
Installation
Change history
Known problems and limitations
Portability
HXT with Filters
Related Work
Feedback

Introduction

The Haskell XML Toolbox is a collection of tools for processing XML with Haskell. It is purely written in Haskell. The Haskell XML Toolbox is a project of the University of Applied Sciences Wedel,

The main design goal of the Haskell XML Toolbox is the support of various XML standards including Extensible Markup Language (XML) 1.0 (Second Edition) with DTD processing and Validation, Namespaces in XML 1.0 (Second Edition), XML Path Language (XPath), XSL Transformations (XSLT), RELAX NG Specification, as well as HTML/XHTML processing.

Description

The Haskell XML Toolbox bases on the ideas of HaXml and HXML, but introduces a more general and flexible approach for processing XML with Haskell. The Haskell XML Toolbox uses a generic data model for representing XML documents, including the DTD subset and the document subset, in Haskell. This data model makes it possible to use filter functions as a uniform design of XML processing applications. The processing filters are implemented as arrows. This is more flexible than the filter approach from HXML and HaXml, but all filter applications can easily be transformed into arrows.

Since version 5.2 HXT works with arrows instead of filters. The filter part has been separated from this library and is available in an extra package (see HXT with Filters) There is a cookbook for using this arrow interface to build (nontrivial) applications. Manuel Ohlendorf has developed examples for processing RDF and has documented the development in his master thesis: A Cookbook for the Haskell XML Toolbox with Examples for Processing RDF Documents (the thesis as PDF)

Features:

Unicode and UTF-8, US-ASCII and ISO-Laitin-1 support
http: and file: protocol support
http access via proxy
wellformed document parsing, validation
namespace support: namespace propagation and checking
XPath support for selection of document parts
liberal HTML parser for interpreting any text containing < ... > as HTML/XML
liberal and lasy lightweight HTML/XML parser based on tagsoup
Relax NG schema validator
integrated XSLT transformer
easy conversion between user defined data structures and XML by the use of pickler functions

Documentation

The HXT API Documentation generated with Haddock.

A (somewhat) gentle introduction to HXT is avalable in the Haskell Wiki. There's also a page about HXT: Conversion of Haskell data from/to XML with picklers.

The XSLT transformer has been developed by Tim Walkenhorst in this master thesis: Implementing an XSLT processor for the Haskell XML Toolbox. It's a rather complete implementation, but it's of course not a substitute for Xalan or other advanced XSLT systems. The XSLT module consists of less than 2000 lines of code. Compared with the more than 300,000 lines of Java for Xalan, this Haskell code can be viewed as one of the first formal specifications for XSLT.

Manuel Ohlendorfs master thesis, describing the arrow interface of the toolbox: A Cookbook for the Haskell XML Toolbox with Examples for Processing RDF Documents (the thesis as PDF). The source code of the example application is included in the doc/cookbook directory of the distribution.

The master's thesis "Design and Implementation of a validating XML parser in Haskell" by Martin Schmidt describes the design and motivation of the Haskell XML Toolbox (the thesis as HTML or PDF) and the development of the DTD validator module. The documentation in the thesis is a bit out of date, the modules and module names and some function names have been changed. For details the online haddock documentation should be used.

The description of the development of the XPath modules is described (in german) in Konzeption und Implementierung eines XPath-Moduls für die Haskell XML Toolbox (PDF-document).

The description of the internals of the Relax NG validator modules is described (in german) in Design und Entwicklung eines Relax NG Schema Validators auf Basis der Haskell XML Toolbox (PDF-document).

Requirements

It is recommended to install the versions available from Hackage.

Downloads

Haskell XML Toolbox 8.3.2, released 2009-10-29:

This version works with ghc 6.10 with cabal 1.6. For ghc 6.8 and cabal & 1.6 please use HXT 8.1.1.
This version does not contain the filter part of hxt any more. That functionality is separated into a package hxt-filter.
Includes sources for building a ghc package hxt with Cabal or make. This package contains a Haskell DOM, an XML parser, a HTML parser based on parsec, a lightweight HTML/XML parser based on tagsoup, a DTD validator, namespace processing functions, a Relax NG validator, an XPath expression evaluator, an XSLT transformer, and serialization/deserialisation of data to/from XML.
HTTP access is done via the curl binding.
Includes various examples, e.g. in example dir examples/arrows/hparser/ a validating parser, which can be used as a starting point for a HXT command line application.
Includes an arrow interface with type classes and overloading for a more flexible use of the filter technique.

A darcs repository is available under http://darcs2.fh-wedel.de/repos/hxt, the web interface is http://darcs2.fh-wedel.de/cgi-bin/darcsweb.cgi.

Installation

Before installing this version, install the curl and tagsoup modules. For a quick install with Cabal execute the following commands in .../HXT-8.3.2

    cabal configure
    cabal build
    sudo cabal install --global

A quick test of the example programs:

    cd examples
    make all
    make test

Installation without Cabal with GNU make:

    make all
    make install               # with root privileges

Change History

In Version 8.3.2 This is a bug fix release.
- New output option a_output_xhtmlfor writing XHTML.
- New output option a_no_empty_elem_for for precise control, which empty elements shall not be emitted in short form <name .../>.
- New output option a_add_default_dtd for easy adding a Document Type Declaration.
- writeDocumentToString changed, such that it is a pure arrow and does not need to run in the IO monad.
- Dealing with URIs containing unescaped chars changed. When URIs can't be parsed (with Network.URI), then the not allowed chars will be escaped in %XX format and URI parsing is retried. This enables normal file names to contain blanks and other chars without explicit escaping.
In Version 8.3.1
- Additional input option "a_accept_mimetypes" for setting a list of allowed mime types when using readDocument.
In Version 8.3.1 This is only a bug fix release.
- Interface and option handling for libcurl reworked. New input option "a_no_redirect" for preventing autmatic redirect added.
- Encoding of none XML/HTML text data done with the same encoding routines as for XML/HTML. This enables easy processing of other text documents.
In Version 8.3.0
- New output option "a_no_empty_elements" for preventing the XML short format "<name/>" for HTML elements, like "script", "p", and others. Especially a script tag of the form "<script href="..."/>" does not work in firefox. Turning on this option gives the form "<script href="..."></script>".
- An input option "a_strict_input" for bytestring input of files added. Lazy input, especially when using the tagsoup parser, can lead to error messages like "too many open files" when processing a whole bunsh of documents.
- Internal representation of qualified names changed to gain more space efficency.
In Version 8.2.0
- Modifications to work with ghc-6.10.
- A new module Data.Atom for dealing with names like LISP atoms and for sharing the memory for these values. When using names as keys in tables, trees or maps, it becomes much more efficient to represent these names as atoms than as strings. Equality check on atoms is constant in time and really fast, and all occurences of an atom share the same internal value.
- Implementation of strictA changed. strictA is marked deprecated. The implementation is not longer done with a DeepSeq function but with Control.Parallel.Strategies, the NFData class and rnf. There is a new combinator rnfA for complete deep evaluation of an arrow result.
- Further functions for working with W3C XML Schema Regular expressions in module Text.XML.HXT.RelaxNG.XmlSchema.RegexMatch, especially for tokenizing and sed like editing of text.
In Version 8.1.1
- New functions for working with W3C XML Schema Regular expressions in module Text.XML.HXT.RelaxNG.XmlSchema.RegexMatch.
- Darcs server has changed, new server is http://darcs2.fh-wedel.de/repos/hxt/.
In Version 8.1.0
- HTTP interface changed to work with libcurl via curl bindings with package curl. So the HTTP package is not longer needed, also the old and somewhat inefficent interface to curl by starting an external process and communicate via a pipe is not longer needed. When installing the curl bindings, be aware that the libcurl development packages including the C header files must be installed. Otherwise the Setup.hs will complain of missing files.
- New input option for ignoring none XML/HTML contents when reading documents (useful for crawler like applications).
- Mime type support for the file: protocol. Mime type mapping can be controlled by a config file in the format of /etc/mime.types on some linux systems.
- A new option for ignoring decoding errors when reading XML documents. This may be useful for crawler like applications.
In Version 8.0.0
- Old filter interface separated from the hxt package and moved to an extra package hxt-filter.
- Version numbers added in hxt.cabal for required package versions.
- DTD validation and XPath modules refactored to work with arrows instead of filters. This is done for separating the old hxt filter library from the actively developed and maintained arrow part.
In Version 7.5
- A module Text.XML.HXT.Arrow.XPathSimple for fast XPath selection for simple XPath queries added. If the XPath query only contains for navigaion the axis from the root down to the leaves, the query is evaluated by computing a simple arrow and applying the arrow. This gives a speedup for queries like /htm/body//h1/text() for extraction of the headline text of a HTML document.
- Option added to control the selection of the parser (XML/HTML) by the mime type of the document.
- New lasy and lightweight HTML/XML parser based on tagsoup library. Useful especially for converting XML into native Haskell data. (Example in examples/arrows/performance)
- W3C XML Schema Datatype library for validating with RelaxNG extended. Currently supported Datatypes: Strings, URIs, QNames, binary string-encoded, decimal and all integer datatypes.
- Regular expression pattern matcher for W3C XML Schema datatype patterns.
- New arrow combinator 'mergeA' in Control.Arrow.ArrowList for combining the components of a tuple, resulting from applying arrows constructed with (&&&), (***) and similar combinators.
In Version 7.4
- Configuration changes for ghc-6.8
- Error in DTD validation algorithm resulting in exponential growth of runtime during validating mixed content elements removed.
In Version 7.3
- Module Text.XML.HXT.Pickle for conversion between user defined data types and the HXT DOM structure extendet. The picklers are extended, such that a DTD can be derived from a pickler and can be checked for consistency. See Haddock doc (Text.XML.HXT.Pickle) and example directory examples/arrows/pickle for examples.
In Version 7.2
- New module Text.XML.HXT.XmlPickler for conversion between user defined data types and the HXT DOM structure. This enables the simple persistent storage and retrieval of arbitrary data with XML documents. See example directory examples/arrows/pickle for a none trivial example of these picklers. These functions are an adaptation of Andrew Kennedy's pickler combinators.
- UTF-8 decoding done with UTF8 module from darcs. Decoding errors are detected and issued. US-ASCII decoding also checks encoding errors.
- ISO-8859-X (X=2..11,13..16) deccoding of input documents implemented
- Some bug fixes.
In Version 7.1
- Version control changed from CVS to darcs. darcs repository is available under http://darcs2.fh-wedel.de/repos/hxt/.
- deepSeq and strict added for XmlTree and an arrow strictA for forcing the evaluation of a whole XML tree. This sometimes saves space when applied after document input, DTD processing and validation.
- Typeable instances added for all DOM data types.
- HTTP access via curl extended to handle automatic redirects
In Version 7.0
- New integrated XSLT transformer module. The example parser (examples/arrows/hparser) is extended to act as an XSLT transformer.
- Errors concerning HTML parsing and implict closing of elements have been fixed.
- Some minor changes of and additions to the arrow API.
- New functions readString and readFromString for reading documents from Haskell strings the same way as reading external documents with readDocument.
In Version 6.1
- HXT 6.1. contains only changes in the cabal installation process for working with the Haskell HTTP module in tar archive http-20060707.tgz. This module no longer depends on The Haskell Cryptographic Library, and NewBinary package.
In Version 6.0
- The arrow interface has changed slightly, especially the handling of user defined states in the state and IO arrows has been simplified. This is the main reason for a 6.0 version.
- The XPath arrows have been extended. There are arrows not only for selecting nodes via an XPath expression, but also for processing and modifying all nodes selected by an XPath expressions (see module Text.XML.HXT.Arrow.XmlNodeSet).
- Separation of the API documentation into two documents, one for the old Filter API and a separate one for the Arrow API. The complete API documentation is still available.
- DTD processing for the arrow part is done completely by arrow based routines.
- Cabal config file and dependencies change to work with ghc 6.4.2.

Known problems and limitations

The parser has been tested with the XML Validation Suite form the W3C. The following problems have been encountered:

Line numbers in XML parser do not always point to the correct position of the syntax error.
Line numbers are not yet reported for validation constraint errors.
The standalone document check is not yet implemented.
The XSLT module does not support the complete XSLT standard.

Portability

Portability to Windows based systems has not been tested very intensively, but did work on an XP system with the Cygwin tools installed. Development was done under Linux with GHC 6.10 with the -Wall flag. No warnings were issued, when compiling the toolbox sources.

HXT with Filters

For older applications using the filter functionality, there is an extra package hxt-filter. This package must be installed on top of hxt. The filter package will not be actively developed any more. Please move to the arrow version for long term projects. Installation works with cabal in the usual way. Download archive is hxt-filter-8.3.0.tar.gz, HXT Filter API Documentation with source links is availabe as well as a darcs repository under http://darcs2.fh-wedel.de/repos/hxt-filter.

Related work

Malcolm Wallace and Colin Runciman wrote HaXml, a collection of utilities for using Haskell and XML together. The Haskell XML Toolbox is based on their idea of using filter combinators for processing XML with Haskell.
Joe English wrote HXML - a non-validating XML parser in Haskell. His idea of validating XML by using derivatives of regular expressions was implemented in the validation functions of this software. Also his ideas and sources for navigateble trees are used in the hxpath modules.

Feedback

We are interested in hearing your feedback on our Haskell XML Toolbox, suggestions for improvements, comments and criticisms.

Mail address is hxmltoolbox@fh-wedel.de

The Haskell XML Toolbox is distributed under the MIT License.
Last modified: 2009-10-29