xml-conduit-1.2.0.1: Pure-Haskell utilities for dealing with XML with the conduit package.

Safe HaskellNone

Text.XML.Stream.Parse

Contents

Description

This module provides both a native Haskell solution for parsing XML documents into a stream of events, and a set of parser combinators for dealing with a stream of events.

As a simple example, if you have the following XML file:

 <?xml version="1.0" encoding="utf-8"?>
 <people>
     <person age="25">Michael</person>
     <person age="2">Eliezer</person>
 </people>

Then this code:

 {-# LANGUAGE OverloadedStrings #-}
 import Control.Monad.Trans.Resource
 import Data.Conduit (($$))
 import Data.Text (Text, unpack)
 import Text.XML.Stream.Parse

 data Person = Person Int Text
     deriving Show

 parsePerson = tagName "person" (requireAttr "age") $ \age -> do
     name <- content
     return $ Person (read $ unpack age) name

 parsePeople = tagNoAttr "people" $ many parsePerson

 main = do
     people <- runResourceT $
             parseFile def "people.xml" $$ force "people required" parsePeople
     print people

will produce:

 [Person {age = 25, name = "Michael"},Person {age = 2, name = "Eliezer"}]

Previous versions of this module contained a number of more sophisticated functions written by Aristid Breitkreuz and Dmitry Olshansky. To keep this package simpler, those functions are being moved to a separate package. This note will be updated with the name of the package(s) when available.

Synopsis

Parsing XML files

parseBytes :: MonadThrow m => ParseSettings -> Conduit ByteString m EventSource

Parses a byte stream into Events. This function is implemented fully in Haskell using attoparsec-text for parsing. The produced error messages do not give line/column information, so you may prefer to stick with the parser provided by libxml-enumerator. However, this has the advantage of not relying on any C libraries.

This relies on detectUtf to determine character encoding, and parseText to do the actual parsing.

parseText :: MonadThrow m => ParseSettings -> Conduit Text m EventPosSource

Parses a character stream into Events. This function is implemented fully in Haskell using attoparsec-text for parsing. The produced error messages do not give line/column information, so you may prefer to stick with the parser provided by libxml-enumerator. However, this has the advantage of not relying on any C libraries.

detectUtf :: MonadThrow m => Conduit ByteString m TextSource

Automatically determine which UTF variant is being used. This function first checks for BOMs, removing them as necessary, and then check for the equivalent of <?xml for each of UTF-8, UTF-16LEBE, and UTF-32LEBE. It defaults to assuming UTF-8.

parseFile :: MonadResource m => ParseSettings -> FilePath -> Producer m EventSource

A helper function which reads a file from disk using enumFile, detects character encoding using detectUtf, parses the XML using parseBytes, and then hands off control to your supplied parser.

parseLBS :: MonadThrow m => ParseSettings -> ByteString -> Producer m EventSource

Parse an event stream from a lazy ByteString.

Parser settings

def :: Default a => a

The default value for this type.

Entity decoding

decodeXmlEntities :: DecodeEntitiesSource

Default implementation of DecodeEntities: handles numeric entities and the five standard character entities (lt, gt, amp, quot, apos).

decodeHtmlEntities :: DecodeEntitiesSource

HTML4-compliant entity decoder. Handles numerics, the five standard character entities, and the additional 248 entities defined by HTML 4 and XHTML 1.

Note that HTML 5 introduces a drastically larger number of entities, and this code does not recognize most of them.

Event parsing

tag :: MonadThrow m => (Name -> Maybe a) -> (a -> AttrParser b) -> (b -> ConduitM Event o m c) -> ConduitM Event o m (Maybe c)Source

The most generic way to parse a tag. It takes a predicate for checking if this is the correct tag name, an AttrParser for handling attributes, and then a parser for dealing with content.

This function automatically absorbs its balancing closing tag, and will throw an exception if not all of the attributes or child elements are consumed. If you want to allow extra attributes, see ignoreAttrs.

This function automatically ignores comments, instructions and whitespace.

tagPredicate :: MonadThrow m => (Name -> Bool) -> AttrParser a -> (a -> ConduitM Event o m b) -> ConduitM Event o m (Maybe b)Source

A simplified version of tag which matches against boolean predicates.

tagName :: MonadThrow m => Name -> AttrParser a -> (a -> ConduitM Event o m b) -> ConduitM Event o m (Maybe b)Source

A simplified version of tag which matches for specific tag names instead of taking a predicate function. This is often sufficient, and when combined with OverloadedStrings and the IsString instance of Name, can prove to be very concise.

tagNoAttr :: MonadThrow m => Name -> ConduitM Event o m a -> ConduitM Event o m (Maybe a)Source

A further simplified tag parser, which requires that no attributes exist.

content :: MonadThrow m => Consumer Event m TextSource

Grabs the next piece of content. If none if available, returns empty. This is simply a wrapper around contentMaybe.

contentMaybe :: MonadThrow m => Consumer Event m (Maybe Text)Source

Grabs the next piece of content if available. This function skips over any comments and instructions and concatenates all content until the next start or end tag.

Attribute parsing

data AttrParser a Source

A monad for parsing attributes. By default, it requires you to deal with all attributes present on an element, and will throw an exception if there are unhandled attributes. Use the requireAttr, optionalAttr et al functions for handling an attribute, and ignoreAttrs if you would like to skip the rest of the attributes on an element.

Alternative instance behave like First monoid. It chooses first parser which doesn't fail.

requireAttr :: Name -> AttrParser TextSource

Require that a certain attribute be present and return its value.

optionalAttr :: Name -> AttrParser (Maybe Text)Source

Return the value for an attribute if present.

ignoreAttrs :: AttrParser ()Source

Skip the remaining attributes on an element. Since this will clear the list of attributes, you must call this after any calls to requireAttr, optionalAttr, etc.

Combinators

orE :: Monad m => Consumer Event m (Maybe a) -> Consumer Event m (Maybe a) -> Consumer Event m (Maybe a)Source

Get the value of the first parser which returns Just. If no parsers succeed (i.e., return Just), this function returns Nothing.

 orE a b = choose [a, b]

choose :: Monad m => [Consumer Event m (Maybe a)] -> Consumer Event m (Maybe a)Source

Get the value of the first parser which returns Just. If no parsers succeed (i.e., return Just), this function returns Nothing.

many :: Monad m => Consumer Event m (Maybe a) -> Consumer Event m [a]Source

Keep parsing elements as long as the parser returns Just.

forceSource

Arguments

:: MonadThrow m 
=> String

Error message

-> ConduitM Event o m (Maybe a) 
-> ConduitM Event o m a 

Force an optional parser into a required parser. All of the tag functions, choose and many deal with Maybe parsers. Use this when you want to finally force something to happen.

Exceptions

Other types