Haskell XML Toolbox 7.3
Contents
Introduction
The Haskell XML Toolbox is a collection of tools for
processing XML with
Haskell.
It is itself purely written in Haskell.
The Haskell XML Toolbox is a project of the
University of Applied Sciences Wedel,
initiated by
Uwe Schmidt.
The core component of the Haskell XML Toolbox
is a validating XML-Parser that supports almost fully the
Extensible Markup Language (XML) 1.0 (Second Edition).
Description
The Haskell XML Toolbox bases on the ideas of
HaXml and
HXML,
but introduces a more general approach for processing XML with
Haskell.
The Haskell XML Toolbox uses a generic data model for
representing XML documents,
including the DTD subset and the document subset, in Haskell.
This data model makes it possible to use filter functions
as a uniform design of XML processing applications.
The whole XML parser including the validator parts
was implemented using this design.
Libraries with filters and combinators are provided for processing the generic data model.
An new more flexible and typesaver API based on arrows instead
of filters is included since version 5.2.
There is a cookbook for using this arrow interface
to build (nontrivial) applications. Manuel Ohlendorf
has developed examples for processing RDF and has documented the
development in his master thesis: A Cookbook for the Haskell
XML Toolbox with Examples for Processing RDF Documents
(the thesis as PDF)
Features:
- Unicode and UTF-8, US-ASCII and ISO-Laitin-1 support
- http: and file: protocol support
- http access via proxy
- wellformed document parsing, validation
- namespace support: namespace propagation and checking
- XPath support for selection of document parts
- liberal HTML parser for interpreting any text containing <
... > as HTML/XML
- Relax NG schema
validator
- integrated XSLT transformer
- easy conversion between user defined data structures and XML
by the use of pickler functions
Documentation
A (somewhat) gentle introduction to HXT is avalable in the
Haskell Wiki.
The XSLT transformer has been developed by Tim Walkenhorst in
this master thesis: Implementing an XSLT
processor for the Haskell XML Toolbox. It's a
rather complete implementation, but it's of course not a
substitute for Xalan or other advanced XSLT systems. The XSLT
module consists of less than 2000 lines of code. Compared with
the more than 300,000 lines of Java for Xalan, this Haskell code
can be viewed as one of the first formal specifications for XSLT.
Manuel Ohlendorfs master thesis, describing the arrow interface
of the toolbox: A Cookbook for the Haskell
XML Toolbox with Examples for Processing RDF Documents
(the thesis as PDF).
The source code of the example application is included in the
doc/cookbook directory of the distribution.
The master's thesis
"Design and Implementation of a validating XML parser in
Haskell"
by Martin Schmidt describes the design and motivation of the
Haskell XML Toolbox
(the thesis as HTML
or PDF) and the development of the DTD
validator module.
The documentation in the thesis is a bit out of date, the modules
and module names and some function names have been changed. For details the online
haddock documentation should be used.
The description of the development of the XPath modules
is described (in german) in Konzeption und Implementierung
eines XPath-Moduls für die Haskell XML Toolbox
(PDF-document).
The description of the internals of the Relax NG validator
modules is described (in german) in Design und Entwicklung
eines Relax NG Schema Validators auf Basis der Haskell XML
Toolbox (PDF-document).
The Filter API Documentation generated
with Haddock.
For the new arrow interface there is a more user friendly Arrow API Documentation
including only the basic types and the arrow descriptions.
Requirements
- GHC-6.6
- GNU make (to build a compiled version) or Cabal
Downloads
Haskell XML Toolbox 7.3,
released 2007-09-10:
- This version works (at least) with ghc 6.6.
- Includes sources for building a ghc package
hxt
with Cabal or make.
This package contains a Haskell DOM, an XML parser, a
DTD validator, namespace processing functions, a Relax
NG validator, an XPath expression evaluator, an XSLT transformer,
and serialization/deserialisation of data to/from XML.
- HTTP access can be done with the
Haskell HTTP module
or with an external program
curl
The HTTP module must be installed before installing the
toolbox.
In the past the installation procedure for HTTP was a bit clumsy because of further
dependencies on The Haskell Cryptographic Library
and NewBinary package. The latest HTTP module
from 2006-07-07 has removed these dependencies and consists only of one HTTP module (not HTTP and Browser).
So there has only one package HTTP to be installed before installing HXT.
- Includes various examples, e.g. in example dir
examples/arrows/hparser/
a validating parser, which can be used as a starting point for
a HXT command line application.
- Provides a HUnit test
HUnitTest
for testing
and demonstration of the available set of filters.
- Includes an arrow interface with type classes and
overloading for a more flexible use of the filter technique.
A darcs repository is available under http://darcs.fh-wedel.de/hxt.
Installation
Before installing this version, install the Haskell HTTP module.
For a quick install with Cabal execute the following commands in
.../HXT-7.3
make setup
./setup configure --ghc
./setup build
./setup install # with root privileges
A quick test of the example programs:
cd examples
make all
make test
Installation without Cabal with GNU make:
make all
make install # with root privileges
Change history
- In Version 7.3
- Module Text.XML.HXT.Pickle for conversion between user
defined data types and the HXT DOM structure extendet.
The picklers are extended, such that a DTD can be derived
from a pickler and can be checked for consistency.
See Haddock doc (Text.XML.HXT.Pickle)
and example directory
examples/arrows/pickle for examples.
- In Version 7.2
- New module Text.XML.HXT.XmlPickler for conversion between user
defined data types and the HXT DOM structure. This enables
the simple persistent storage and retrieval of arbitrary
data with XML documents. See example directory
examples/arrows/pickle for a none trivial example
of these picklers. These functions are an adaptation of
Andrew
Kennedy's pickler combinators.
- UTF-8 decoding done with UTF8 module from
darcs.
Decoding errors are detected and issued.
US-ASCII decoding also checks encoding errors.
-
ISO-8859-X (X=2..11,13..16) deccoding of input documents implemented
-
Some bug fixes.
- In Version 7.1
- Version control changed from CVS to darcs.
darcs repository is available under
http://darcs.fh-wedel.de/hxt/.
- deepSeq and strict added for XmlTree
and an arrow strictA for forcing the evaluation of a whole
XML tree. This sometimes saves space when applied after document input,
DTD processing and validation.
- Typeable instances added for all DOM data types.
- HTTP access via curl extended to handle automatic redirects
- In Version 7.0
- New integrated XSLT transformer module. The example
parser (examples/arrows/hparser) is extended to act as an XSLT transformer.
- Errors concerning HTML parsing and implict closing of
elements have been fixed.
- Some minor changes of and additions to the arrow API.
- New functions readString and
readFromString for reading documents from Haskell
strings the same way as reading external documents with readDocument.
- In Version 6.1
- In Version 6.0
- The arrow interface has changed slightly, especially the handling of user defined states in the state and IO arrows has been simplified.
This is the main reason for a 6.0 version.
- The XPath arrows have been extended.
There are arrows not only for selecting nodes via an XPath expression, but also
for processing and modifying all nodes selected
by an XPath expressions (see module Text.XML.HXT.Arrow.XmlNodeSet).
- Separation of the API documentation into two documents, one for the
old Filter API
and a separate one for the Arrow API.
The complete API documentation is still available.
- DTD processing for the arrow part is done completely by arrow based routines.
- Cabal config file and dependencies change to work with ghc 6.4.2.
- In Version 5.5
- Changes in file path and default base URI handling.
This release compiles and runs (at least) under Windows XP,
Cygwin (DLL 1.5.19-4 release)
and ghc 6.4.1.
- In Version 5.4
- Bug fix in parameter entity substitution for external
parameter entities.
- In Version 5.3
-
Documentation and usage examples for the new arrow interface
of the toolbox: A Cookbook for the Haskell
XML Toolbox with Examples for Processing RDF Documents
(PDF-document).
- Integration of Relax NG schema validator into the
toolbox.
A usage example is included in the
examples/arrows/hparser/HXmlParser.hs source.
- Some data structure changes made for runtime and space
optimization.
- Parameter entity substitution reworked because of bugs in
nested and recursive parameter entity substitution.
- In Version 5.2
- Bug fix with entity substitution of nested
external parameter entities in DTDs.
- Some minor changes in the arrow interface.
- In Version 5.01
- Sources ported to work with ghc-6.4 and ghc-6.4 is required.
Please use toolbox versions <= 5.00 for older ghc's.
- Native HTTP-modules removed from the distribution.
Install the latest version of the Haskell HTTP module
before installing Version 5.01
- Interface for the HTTP access via curl done with
routines form System.Process, so the deprecated module POSIX
is no longer needed.
- Build and install can be done with Cabal. See README file for installation.
- In Version 5.00
- All compiled modules are packaged into
a single ghc package
hxt
.
- A new interface for arrows is added.
Modules in this interface are found in
Control.Arrow
and Text.XML.HXT.Arrow
(see Haddock documentation)
- Examples for the usage of the arrow interface are found
in the examples/arrows subdirectory.
Known problems and limitations
The parser has been tested with the XML Validation Suite form the
W3C. The following problems have been encountered:
- Line numbers in XML parser do not always point to the
correct position of the syntax error.
- Line numbers are not yet reported for validation constraint
errors.
- The standalone document check is not yet implemented.
- The XSLT module does not support the complete XSLT standard.
Portability
Portability to Windows based systems has not been tested very
intensively, but did work on an XP system with the Cygwin tools installed.
Development was done under Linux with GHC 6.4 with the -Wall
flag. No warnings were issued, when compiling the toolbox sources.
Haskell Modules and Libraries used in the toolbox
Related work
- Malcolm Wallace and Colin Runciman wrote
HaXml,
a collection of utilities for using Haskell and XML together.
The Haskell XML Toolbox is based on their idea
of using filter combinators for processing XML with Haskell.
- Joe English wrote
HXML
- a non-validating XML parser in Haskell.
His
idea
of validating XML by using derivatives of regular expressions
was
implemented in the validation functions of this software.
Also his ideas and sources for navigateble trees are used
in the hxpath modules.
Feedback
We are interested in hearing your feedback
on our Haskell XML Toolbox, suggestions
for improvements, comments and criticisms.
Mail address is
hxmltoolbox@fh-wedel.de
The Haskell XML Toolbox
is distributed under the
MIT License.
|
|
Last modified: 2007-09-10 | |