HXQ: A Compiler from XQuery to Haskell

Description

HXQ is a fast and space-efficient translator from XQuery (the standard query language for XML) to embedded Haskell code. The translation is based on Haskell templates. HXQ takes full advantage of Haskell's lazy evaluation to keep in memory only those parts of XML data needed at each point of evaluation, thus performing stream-based evaluation for forward queries (queries that do not contain backward steps). This results to an implementation that is as fast and space-efficient as any stream-based implementation based on SAX filters or finite state machines. Furthermore, the coding is far simpler and extensible since its based on XML trees, rather than SAX events.

For example, the XQuery given at the bottom of this page, which is against the DBLP XML database (420MB), runs in 39 seconds on my laptop PC (using 18MB of max heap space). To contrast this, Qexo, which compiles XQueries to Java bytecode, took 1 minute 17 seconds (using no less than 1400MB of heap space). Also XQilla, which is written in C++, took 1 minute and 10 secs (using 1150MB of heap space). (All results are taken on an Intel Core 2 Duo 2.2GHz 2GB running ghc-6.8.2 on Linux 2.6.23 kernel.)

Finally, HXQ can store XML documents in a relational database (currently SQLite), by shredding XML into relational tuples, and by translating XQueries over the shredded documents into optimized SQL queries.

Installation Instructions

First, you need to install the Glasgow Haskell Compiler, ghc, the parser generator for Haskell, happy, and SQLite for database connectivity (optional). For example, in Fedora Linux, you install them using:

yum install ghc happy sqlite
Then, you need to install the haskell packages: HDBC and the HDBC-sqlite3 driver to connect to SQLite relational databases.

Finally, download HXQ and untar it. You can use either make or cabal to build it. To build it with cabal, you do:

runhaskell Setup.lhs configure --prefix=$HOME
runhaskell Setup.lhs build
runhaskell Setup.lhs install
(The last command must be run as root.) It will create the executable xquery, which is the XQuery interpreter, and the HXQ library. To use the library, you run ghc or ghci with -package HXQ.

Current Status

HXQ supports most essential XQuery features, although some system functions are missing (but are easy to add). To see the list of supported system functions, run xquery -help . HXQ does not have static typechecking; it leaves all checking to Haskell. This means that it distinguishes regular predicates from indexing at run time: if an XPath predicate returns an integer at run time, it is taken as indexing and this index is checked against the current node position. The most important omission is backward step axes, such as /.. (parent). Some, but not all, parent axis steps are removed using optimization rules; all others cause a compilation error. Finally, the XQuery semantics requires duplicate elimination and sorting by document order for every XPath step, which is very expensive and unnecessary in most cases. For example, e//*//* may return duplicate elements in HXQ. This will be addressed in the future (needs a static analysis to determine when duplicate elimination is necessary).

HXQ uses the HXML parser for XML (developed by Joe English), which is included in the source. I have also tried hexpat, tagsoup, HXT, and HaXML Xtract, but they all have space leaks.

Using the Compiler

The main functions for embedding XQueries in Haskell are:

where query is a string value (a Haskell expression that evaluates to a string at compile-time). They both translate the query into Haskell code, which is compiled and optimized into machine code directly. The code that xe generates has type XSeq (a sequence of XML trees of type [XTree]) while the code that xq generates has type (IO XSeq). If the query reads at least one document (using doc(...)), then you should use xq since it requires IO. To define constant XML data or a function body, it is better to use xe. You can use the value of a Haskell variable v inside a query using $v as long as v has type XSeq. To use a function in a query, it should be defined in Haskell with type (XSeq,...,XSeq) -> XSeq.

Here is an example of a main program:

f(x,y) = $(xe "<article><first>{$x}</first><second>{$y}</second></article>")

main = do a <- $(xq ("<result>{                                                        "
                 ++"     for $x at $i in doc('data/dblp.xml')//inproceedings           "
                 ++"     where $x/author = 'Leonidas Fegaras'                          "
                 ++"     order by $x/year descending                                   "
                 ++"     return <paper>{ $i, ') ', $x/booktitle/text(),                "
                 ++"                     ': ', $x/title/text()                         "
                 ++"            }</paper>                                              "
                 ++"  }</result>                                                       "))
          putXSeq a
          b <- $(xq " f( $a/paper[10], $a/paper[8] ) ")
          putXSeq b
Another example, can be found in Test1.hs.

You can compile an XQuery file into a Haskell program (Temp.hs) using xquery -c file. Or better, you can use the Unix shell script compile to compile the XQuery file to an executable. For example:

compile data/q1.xq
will compile the XQuery file data/q1.xq into a.out.

Using the Interpreter

The HXQ interpreter is far more slower than the compiler; use it only if you need to evaluate ad-hoc XQueries read from input or from files. The main functions are:

The HXQ interpreter doesn't recognize Haskell variables and functions (but you may declare XQuery variables and functions using the XQuery 'declare' syntax). The main HXQ program, called xquery, evaluates an XQuery in a file using the interpreter. For example:
xquery data/q1.xq
Without an argument, it reads and evaluates XQueries and variable/function declarations from input. With xquery -p xpath-query xml-file you evaluate an XPath query against an XML file, eg. xquery -p "//inproceedings[100]" data/dblp.xml. With xquery -help you get the list of system functions and usage information.

Database Connectivity

HXQ provides an interface to HDBC to query relational data inside an XQuery. For the HXQ compiler, the main function that allows database connectivity is:

$(xqdb query) :: (IConnection conn) => conn -> IO XSeq
For example, if the database name is "myDB", then
do db <- connect "myDB"
   result <- $(xqdb xquery) db
For the HXQ interpreter, the function is:
xqueryDB :: (IConnection conn) => String -> conn -> IO XSeq
The xquery executable can also run XQueries that use a database by specifying the database name using the -db option.

Currently, HXQ works with SQLite only, but is very easy to make it work with any relational database that supports ODBC: simply install HDBC-odbc and change the file XML/HXQ/DBConnect.hs accordingly.

Querying an Existing Database

An XQuery may contain multiple SQL queries in the form sql(query,args), where query is the sql query that may contain parameters (denoted by ?), which are bound to the values in args (an XSeq). An example can be found in TestDB.hs. To run this example, you need to install the company database. For example, using the sqlite3 interpreter, you do:

sqlite3 myDB
.read data/company.sql
.quit
and then compile and run TestDB.hs (or just do make test3 on linux).

Shredding

To store an XML document into a relational database, use the following Haskell function:

shred :: (IConnection conn) => conn -> String -> String -> IO ()
shred db file name
that shreds and stores the XML document located at the file pathname in the database db under a unique name. HXQ will find a good relational schema (using hybrid inlining) to store the XML data by first scanning the document to extract its structural summary, then deriving a good relational schema, and finally scanning the document for a second time to store its data into the relational tables. For example,
do db <- connect "myDB"
   shred db "data/cs.xml" "c"

The haskell function

createIndex db name tagname
creates a secondary index on tagname for the shredded document under the given name.

Publishing

You can query a shredded XML document using the XQuery function:

publish(dbame,name)
where dbname is the database file name and name is the unique name assigned to the XML document when was shredded. The translation from XQuery to SQL is done at compile-time, so both dbname and name must be constant strings. HXQ will do its best to push relevant predicates to the generated SQL query (using partial evaluation and code folding), thus deriving an efficient execution. One example is TestDB2.hs (do make test4 to run it on linux).

Experimental Features

There is an experimental program, called hxqc, that works like xquery but is faster because it uses the HXQ compiler instead of the interpreter. It's constructed with the Unix Makefile (make hxqc). It is unstable and the hxqc executable is 10 times larger than xquery since it uses the ghc libraries at run-time. Also, although it is faster than the interpreter, it is slower than a compiled xquery because it uses part of the heap for the ghc compiler to compile Haskell code on-the-fly.


Last modified: 06/27/08 by Leonidas Fegaras