xml-enumerator-0.0.0.1: Pure-Haskell utilities for dealing with XML with the enumerator package.

Text.XML.Enumerator.Parse

Contents

Description

This module provides both a native Haskell solution for parsing XML documents into a stream of events, and a set of parser combinators for dealing with a stream of events.

The important thing to know about the combinators is that they do not work on the fully-powered Event datatype; rather, this module defines an SEvent datatype which only deals with tags, attributes and content. For most uses, this is sufficient. If you need to parse doctypes, instructions or contents, you will not be able to use the combinators.

As a simple example, if you have the following XML file:

 <?xml version="1.0" encoding="utf-8"?>
 <people>
     <person age="25">Michael</person>
     <person age="2">Eliezer</person>
 </people>

Then this code:

 {-# LANGUAGE OverloadedStrings #-}
 import Text.XML.Enumerator.Parse
 import Data.Text.Lazy (Text, unpack)
 
 data Person = Person { age :: Int, name :: Text }
     deriving Show
 
 parsePerson = tag' "person" (requireAttr "age") $ \age -> do
     name <- content'
     return $ Person (read $ unpack age) name
 
 parsePeople = tag'' "people" $ many parsePerson
 
 main = parseFile_ "people.xml" (const Nothing) $ force "people required" parsePeople

will produce:

 [Person {age = 25, name = "Michael"},Person {age = 2, name = "Eliezer"}]

Synopsis

Parsing XML files

parseBytes :: Monad m => Enumeratee ByteString Event m aSource

Parses a UTF8-encoded byte stream into Events. This function is implemented fully in Haskell using attoparsec for parsing. The produced error messages do not give line/column information, so you may prefer to stick with the parser provided by libxml-enumerator. However, this has the advantage of not relying on any C libraries.

If you are uncertain of the character encoding, use the detectUtf enumeratee.

detectUtf :: Monad m => Enumeratee ByteString ByteString m aSource

Automatically determine which UTF variant is being used. This function first checks for BOMs, removing them as necessary, and then check for the equivalent of <?xml for each of UTF-8, UTF-16LEBE, and UTF-32LEBE. It defaults to assuming UTF-8. The output byte stream is guaranteed to be valid UTF-8 bytes.

Simplified events

data SEvent Source

A greatly simplified XML event datatype. The best way to produce these values is the simplify enumeratee.

Instances

simplify :: Monad m => (Text -> Maybe Text) -> Enumeratee Event SEvent m bSource

Convert a stream of Events into a stream SEvents. The first argument is a function to decode character entity references. Some things to note about this function:

  • It drops events for document begin/end, comments, and instructions.
  • It concatenates all pieces of content together. The output of this function is guaranteed to not have two consecutive SContents.
  • It automatically checks that tag beginnings and endings are well balanced, and throws an exception otherwise.
  • It also throws an exception if your supplied entity function does not know how to deal with a character entity.

Please also note that you do not need to handle the 5 XML-defined character entity references (lt, gt, amp, quot and apos), nor deal with numeric entities (decimal and hex).

type SAttr = (Name, Text)Source

A simplified attribute, having all entities converted to text.

parseFile :: String -> (Text -> Maybe Text) -> Iteratee SEvent IO a -> IO (Either SomeException a)Source

A helper function which reads a file from disk using enumFile, detects character encoding using detectUtf, parses the XML using parseBytes, converts to an SEvent stream using simplify and then handing off control to your supplied parser.

parseFile_ :: String -> (Text -> Maybe Text) -> Iteratee SEvent IO a -> IO aSource

The same as parseFile, but throws any exceptions.

SEvent parsing

tag :: Monad m => (Name -> Maybe a) -> (a -> AttrParser b) -> (b -> Iteratee SEvent m c) -> Iteratee SEvent m (Maybe c)Source

The most generic way to parse a tag. It takes a predicate for checking if this is the correct tag name, an AttrParser for handling attributes, and then a parser for dealing with content.

This function automatically absorbs its balancing closing tag, and will throw an exception if not all of the attributes or child elements are consumed. If you want to allow extra attributes, see ignoreAttrs.

tag' :: Monad m => Name -> AttrParser a -> (a -> Iteratee SEvent m b) -> Iteratee SEvent m (Maybe b)Source

A simplified version of tag which matches for specific tag names instead of taking a predicate function. This is often sufficient, and when combined with OverloadedStrings and the IsString instance of Name, can prove to be very concise.

tag'' :: Monad m => Name -> Iteratee SEvent m a -> Iteratee SEvent m (Maybe a)Source

A further simplified tag parser, which requires that no attributes exist.

content :: Monad m => Iteratee SEvent m (Maybe Text)Source

Grabs the next piece of content if available.

content' :: Monad m => Iteratee SEvent m TextSource

Grabs the next piece of content. If none if available, returns empty.

Attribute parsing

data AttrParser a Source

A monad for parsing attributes. By default, it requires you to deal with all attributes present on an element, and will throw an exception if there are unhandled attributes. Use the requireAttr, optionalAttr et al functions for handling an attribute, and ignoreAttrs if you would like to skip the rest of the attributes on an element.

requireAttr :: Name -> AttrParser TextSource

Require that a certain attribute be present and return its value.

optionalAttr :: Name -> AttrParser (Maybe Text)Source

Return the value for an attribute if present.

ignoreAttrs :: AttrParser ()Source

Skip the remaining attributes on an element. Since this will clear the list of attributes, you must call this after any calls to requireAttr, optionalAttr, etc.

Combinators

choose :: Monad m => [Iteratee SEvent m (Maybe a)] -> Iteratee SEvent m (Maybe a)Source

Get the value of the first parser which returns Just. If none return Just, returns Nothing.

many :: Monad m => Iteratee SEvent m (Maybe a) -> Iteratee SEvent m [a]Source

Keep parsing elements as long as the parser returns Just.

forceSource

Arguments

:: Monad m 
=> String

Error message

-> Iteratee SEvent m (Maybe a) 
-> Iteratee SEvent m a 

Force an optional parser into a required parser. All of the tag functions, choose and many deal with Maybe parsers. Use this when you want to finally force something to happen.

Exceptions