xml-conduit-1.4.0.1: Pure-Haskell utilities for dealing with XML with the conduit package.

Safe HaskellNone
LanguageHaskell98

Text.XML.Stream.Parse

Contents

Description

This module provides both a native Haskell solution for parsing XML documents into a stream of events, and a set of parser combinators for dealing with a stream of events.

As a simple example, if you have the following XML file:

<?xml version="1.0" encoding="utf-8"?>
<people>
    <person age="25">Michael</person>
    <person age="2">Eliezer</person>
</people>

Then this code:

{-# LANGUAGE OverloadedStrings #-}
import Control.Monad.Trans.Resource
import Data.Conduit (($$))
import Data.Text (Text, unpack)
import Text.XML.Stream.Parse

data Person = Person Int Text
    deriving Show

parsePerson = tagName "person" (requireAttr "age") $ \age -> do
    name <- content
    return $ Person (read $ unpack age) name

parsePeople = tagNoAttr "people" $ many parsePerson

main = do
    people <- runResourceT $
            parseFile def "people.xml" $$ force "people required" parsePeople
    print people

will produce:

[Person {age = 25, name = "Michael"},Person {age = 2, name = "Eliezer"}]

This module also supports streaming results using yield. This allows parser results to be processed using conduits while a particular parser (e.g. many) is still running. Without using streaming results, you have to wait until the parser finished before you can process the result list. Large XML files might be easier to process by using streaming results. See http://stackoverflow.com/q/21367423/2597135 for a related discussion.

{-# LANGUAGE OverloadedStrings #-}
import Control.Monad.Trans.Resource
import Data.Conduit
import Data.Text (Text, unpack)
import Text.XML.Stream.Parse
import Text.XML (Name)
import Control.Monad.Trans.Class (lift)
import Control.Monad (void)
import qualified Data.Conduit.List as CL

data Person = Person Int Text deriving Show

parsePerson = tagName "person" (requireAttr "age") $ \age -> do
    name <- content
    return $ Person (read $ unpack age) name

parsePeople = void $ tagNoAttr "people" $ manyYield parsePerson

main = runResourceT $
    parseFile def "people.xml" $$ parsePeople =$ CL.mapM_ (lift . print)

Previous versions of this module contained a number of more sophisticated functions written by Aristid Breitkreuz and Dmitry Olshansky. To keep this package simpler, those functions are being moved to a separate package. This note will be updated with the name of the package(s) when available.

Synopsis

Parsing XML files

parseBytes :: MonadThrow m => ParseSettings -> Conduit ByteString m Event Source #

Parses a byte stream into Events. This function is implemented fully in Haskell using attoparsec-text for parsing. The produced error messages do not give line/column information, so you may prefer to stick with the parser provided by libxml-enumerator. However, this has the advantage of not relying on any C libraries.

This relies on detectUtf to determine character encoding, and parseText' to do the actual parsing.

parseText' :: MonadThrow m => ParseSettings -> Conduit Text m Event Source #

Parses a character stream into Events. This function is implemented fully in Haskell using attoparsec-text for parsing. The produced error messages do not give line/column information, so you may prefer to stick with the parser provided by libxml-enumerator. However, this has the advantage of not relying on any C libraries.

Since 1.2.4

parseTextPos :: MonadThrow m => ParseSettings -> Conduit Text m EventPos Source #

Same as parseText', but includes the position of each event.

Since 1.2.4

detectUtf :: MonadThrow m => Conduit ByteString m Text Source #

Automatically determine which UTF variant is being used. This function first checks for BOMs, removing them as necessary, and then check for the equivalent of <?xml for each of UTF-8, UTF-16LEBE, and UTF-32LEBE. It defaults to assuming UTF-8.

parseFile :: MonadResource m => ParseSettings -> FilePath -> Producer m Event Source #

A helper function which reads a file from disk using enumFile, detects character encoding using detectUtf, parses the XML using parseBytes, and then hands off control to your supplied parser.

parseLBS :: MonadThrow m => ParseSettings -> ByteString -> Producer m Event Source #

Parse an event stream from a lazy ByteString.

Parser settings

def :: Default a => a #

The default value for this type.

psRetainNamespaces :: ParseSettings -> Bool Source #

Whether the original xmlns attributes should be retained in the parsed values. For more information on motivation, see:

https://github.com/snoyberg/xml/issues/38

Default: False

Since 1.2.1

Entity decoding

decodeXmlEntities :: DecodeEntities Source #

Default implementation of DecodeEntities: handles numeric entities and the five standard character entities (lt, gt, amp, quot, apos).

decodeHtmlEntities :: DecodeEntities Source #

HTML4-compliant entity decoder. Handles numerics, the five standard character entities, and the additional 248 entities defined by HTML 4 and XHTML 1.

Note that HTML 5 introduces a drastically larger number of entities, and this code does not recognize most of them.

Event parsing

tag Source #

Arguments

:: MonadThrow m 
=> (Name -> Maybe a)

Check if this is a correct tag name and return a value that can be used to get an AttrParser. If this returns Nothing, the function will also return Nothing

-> (a -> AttrParser b)

Given the value returned by the name checker, this function will be used to get an AttrParser appropriate for the specific tag.

-> (b -> ConduitM Event o m c)

Handler function to handle the attributes and children of a tag, given the value return from the AttrParser

-> ConduitM Event o m (Maybe c) 

The most generic way to parse a tag. It takes a predicate for checking if this is the correct tag name, an AttrParser for handling attributes, and then a parser for dealing with content.

Events are consumed if and only if the predicate holds.

This function automatically absorbs its balancing closing tag, and will throw an exception if not all of the attributes or child elements are consumed. If you want to allow extra attributes, see ignoreAttrs.

This function automatically ignores comments, instructions and whitespace.

tagPredicate Source #

Arguments

:: MonadThrow m 
=> (Name -> Bool)

Name predicate that returns True if the name matches the parser

-> AttrParser a

The attribute parser to be used for tags matching the predicate

-> (a -> ConduitM Event o m b)

Handler function to handle the attributes and children of a tag, given the value return from the AttrParser

-> ConduitM Event o m (Maybe b) 

A simplified version of tag which matches against boolean predicates.

tagName Source #

Arguments

:: MonadThrow m 
=> Name

The tag name this parser matches to (includes namespaces)

-> AttrParser a

The attribute parser to be used for tags matching the predicate

-> (a -> ConduitM Event o m b)

Handler function to handle the attributes and children of a tag, given the value return from the AttrParser

-> ConduitM Event o m (Maybe b) 

A simplified version of tag which matches for specific tag names instead of taking a predicate function. This is often sufficient, and when combined with OverloadedStrings and the IsString instance of Name, can prove to be very concise. . Note that Name is namespace sensitive. When using the IsString instance of name, use > "{http:/ab}c" :: Name to match the tag c in the XML namespace http://a/b

tagNoAttr Source #

Arguments

:: MonadThrow m 
=> Name

The name this parser matches to

-> ConduitM Event o m a

Handler function to handle the children of the matched tag

-> ConduitM Event o m (Maybe a) 

A further simplified tag parser, which requires that no attributes exist.

tagIgnoreAttrs Source #

Arguments

:: MonadThrow m 
=> Name

The name this parser matches to

-> ConduitM Event o m a

Handler function to handle the children of the matched tag

-> ConduitM Event o m (Maybe a) 

A further simplified tag parser, which ignores all attributes, if any exist

tagPredicateIgnoreAttrs Source #

Arguments

:: MonadThrow m 
=> (Name -> Bool)

The name predicate this parser matches to

-> ConduitM Event o m a

Handler function to handle the children of the matched tag

-> ConduitM Event o m (Maybe a) 

A further simplified tag parser, which ignores all attributes, if any exist

content :: MonadThrow m => Consumer Event m Text Source #

Grabs the next piece of content. If none if available, returns empty. This is simply a wrapper around contentMaybe.

contentMaybe :: MonadThrow m => Consumer Event m (Maybe Text) Source #

Grabs the next piece of content if available. This function skips over any comments and instructions and concatenates all content until the next start or end tag.

Ignoring tags/trees

ignoreTag Source #

Arguments

:: MonadThrow m 
=> (Name -> Bool)

The predicate name to match to

-> ConduitM Event o m (Maybe ()) 

Ignore an empty tag and all of its attributes by predicate. This does not ignore the tag recursively (i.e. it assumes there are no child elements). This functions returns Just if the tag matched.

ignoreTagName Source #

Arguments

:: MonadThrow m 
=> Name

The name to match to

-> ConduitM Event o m (Maybe ()) 

Like ignoreTag, but matches an exact name

ignoreAnyTagName Source #

Arguments

:: MonadThrow m 
=> [Name]

The name to match to

-> ConduitM Event o m (Maybe ()) 

Like ignoreTagName, but matches any name from a list of names.

ignoreAllTags :: MonadThrow m => ConduitM Event o m (Maybe ()) Source #

Like ignoreTag, but matches all tag name.

ignoreAllTags = ignoreTag (const True)

ignoreTree Source #

Arguments

:: MonadThrow m 
=> (Name -> Bool)

The predicate name to match to

-> ConduitM Event o m (Maybe ()) 

Ignore an empty tag, its attributes and its children subtree recursively. Both content and text events are ignored. This functions returns Just if the tag matched.

ignoreTreeName :: MonadThrow m => Name -> ConduitM Event o m (Maybe ()) Source #

Like ignoreTagName, but also ignores non-empty tabs

ignoreAnyTreeName Source #

Arguments

:: MonadThrow m 
=> [Name]

The name to match to

-> ConduitM Event o m (Maybe ()) 

Like ignoreTagName, but matches any name from a list of names.

ignoreAllTrees :: MonadThrow m => ConduitM Event o m (Maybe ()) Source #

Like ignoreAllTags, but ignores entire subtrees.

ignoreAllTrees = ignoreTree (const True)

ignoreAllTreesContent :: MonadThrow m => ConduitM Event o m (Maybe ()) Source #

Like ignoreAllTrees, but also ignores all content events

Attribute parsing

data AttrParser a Source #

A monad for parsing attributes. By default, it requires you to deal with all attributes present on an element, and will throw an exception if there are unhandled attributes. Use the requireAttr, attr et al functions for handling an attribute, and ignoreAttrs if you would like to skip the rest of the attributes on an element.

Alternative instance behaves like First monoid: it chooses first parser which doesn't fail.

attr :: Name -> AttrParser (Maybe Text) Source #

Return the value for an attribute if present.

requireAttr :: Name -> AttrParser Text Source #

Shortcut composition of force and attr.

optionalAttr :: Name -> AttrParser (Maybe Text) Source #

Deprecated: Please use attr.

ignoreAttrs :: AttrParser () Source #

Skip the remaining attributes on an element. Since this will clear the list of attributes, you must call this after any calls to requireAttr, optionalAttr, etc.

Combinators

orE Source #

Arguments

:: Monad m 
=> Consumer Event m (Maybe a)

The first (preferred) parser

-> Consumer Event m (Maybe a)

The second parser, only executed if the first parser fails

-> Consumer Event m (Maybe a) 

Get the value of the first parser which returns Just. If no parsers succeed (i.e., return Just), this function returns Nothing.

orE a b = choose [a, b]

choose Source #

Arguments

:: Monad m 
=> [ConduitM Event o m (Maybe a)]

List of parsers that will be tried in order.

-> ConduitM Event o m (Maybe a)

Result of the first parser to succeed, or Nothing if no parser succeeded

Get the value of the first parser which returns Just. If no parsers succeed (i.e., return Just), this function returns Nothing.

many :: Monad m => Consumer Event m (Maybe a) -> Consumer Event m [a] Source #

Keep parsing elements as long as the parser returns Just.

manyIgnore :: Monad m => Consumer Event m (Maybe a) -> Consumer Event m (Maybe ()) -> Consumer Event m [a] Source #

Keep parsing elements as long as the parser returns Just or the ignore parser returns Just.

many' :: MonadThrow m => Consumer Event m (Maybe a) -> Consumer Event m [a] Source #

Like many, but any tags and content the consumer doesn't match on are silently ignored.

force Source #

Arguments

:: MonadThrow m 
=> String

Error message

-> m (Maybe a)

Optional parser to be forced

-> m a 

Force an optional parser into a required parser. All of the tag functions, attr, choose and many deal with Maybe parsers. Use this when you want to finally force something to happen.

Streaming combinators

manyYield :: Monad m => ConduitM a b m (Maybe b) -> Conduit a m b Source #

Like many, but uses yield so the result list can be streamed to downstream conduits without waiting for manyYield to finish

manyIgnoreYield Source #

Arguments

:: MonadThrow m 
=> ConduitM Event b m (Maybe b)

Consuming parser that generates the result stream

-> Consumer Event m (Maybe ())

Ignore parser that consumes elements to be ignored

-> Conduit Event m b 

Like manyIgnore, but uses yield so the result list can be streamed to downstream conduits without waiting for manyYield to finish

manyYield' :: MonadThrow m => ConduitM Event b m (Maybe b) -> Conduit Event m b Source #

Like many', but uses yield so the result list can be streamed to downstream conduits without waiting for manyYield to finish

takeAllTreesContent :: MonadThrow m => Conduit Event m Event Source #

Like ignoreAllTreesContent, but stream the corresponding Events rather than ignoring them. Incomplete elements (without a closing-tag) will trigger an XmlException.

>>> runResourceT $ parseLBS def "text<a></a>" $$ takeAllTreesContent =$= consume
Just [ EventContent (ContentText "text"), EventBeginElement "a" [], EventEndElement "a"]
>>> runResourceT $ parseLBS def "</a><b></b>" $$ takeAllTreesContent =$= consume
Just [ ]
>>> runResourceT $ parseLBS def "<b><c></c></b></a>text" $$ takeAllTreesContent =$= consume
Just [ EventBeginElement "b" [], EventBeginElement "c" [], EventEndElement "c", EventEndElement "b" ]

Since 1.4.0

Exceptions

Other types