tagsoup-megaparsec: A Tag token parser and Tag specific parsing combinators

[ bsd3, library, xml ] [ Propose Tags ]

Please see README.md

[Skip to Readme]




Manual Flags


Enable -Werror


Use -f <flag> to enable a flag, or -f -<flag> to disable that flag. More info


Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees


  • No Candidates
Versions [RSS],
Change log CHANGELOG.md
Dependencies base (>=4.7 && <5), containers (>=0.5 && <0.6), megaparsec (>=5.0.1 && <6), semigroups (>=0.18 && <0.19), tagsoup (>=0.13 && <0.15) [details]
License BSD-3-Clause
Copyright BSD3
Author Kwang Yul Seo
Maintainer kwangyul.seo@gmail.com
Category XML
Home page https://github.com/kseo/tagsoup-megaparsec#readme
Source repo head: git clone https://github.com/kseo/tagsoup-megaparsec
Uploaded by kseo at 2016-08-03T03:36:10Z
Reverse Dependencies 1 direct, 0 indirect [details]
Downloads 1292 total (9 in the last 30 days)
Rating 2.0 (votes: 1) [estimated by Bayesian average]
Your Rating
  • λ
  • λ
  • λ
Status Docs available [build log]
Last success reported on 2016-08-03 [all 1 reports]

Readme for tagsoup-megaparsec-

[back to package description]


Build Status

A Tag token parser and Tag specific parsing combinators, inspired by parsec-tagsoup and tagsoup-parsec. This library helps you build a megaparsec parser using TagSoup's Tag as tokens.


DOM parser

We can build a DOM parser using TagSoup's Tag as a token type in Megaparsec. Let's start the example with importing all the required modules.

import Data.Text ( Text )
import qualified Data.Text as T
import Data.HashMap.Strict ( HashMap )
import qualified Data.HashMap.Strict as HMS
import Text.HTML.TagSoup
import Text.Megaparsec
import Text.Megaparsec.ShowToken
import Text.Megaparsec.TagSoup

Here's the data types used to represent our DOM. Node is either ElementNode or TextNode. TextNode data constructor takes a Text and ElementNode data constructor takes an Element whose fields consist of elementName, elementAttrs and elementChildren.

type AttrName   = Text
type AttrValue  = Text

data Element = Element
  { elementName :: !Text
  , elementAttrs :: !(HashMap AttrName AttrValue)
  , elementChildren :: [Node]
  } deriving (Eq, Show)

data Node =
    ElementNode Element
  | TextNode Text
  deriving (Eq, Show)

Our Parser is defined as a type synonym for TagParser Text. TagParser takes a type argument representing the string type and we chose Text here. We can pass any of StringLike types such as String and ByteString.

type Parser = TagParser Text

There is nothing new in defining a parser except that our token is Tag Text instead of Char. We can use any Megaparsec combinators we want as usual. Our node parser is either element or text so we used the choice combinator (<|>).

node :: Parser Node
node = ElementNode <$> element
   <|> TextNode <$> text

tagsoup-megaparsec library provides some Tag specific combinators.

  • tagText: parse a chunk of text.
  • anyTagOpen/anyTagClose: parse any opening and closing tag.

text and element parsers are built using these combinators.

NOTE: We don't need to worry about the text blocks containing only whitespace characters because all the parsers provided by tagsoup-megaparsec are lexeme parsers.

text :: Parser Text
text = fromTagText <$> tagText

element :: Parser Element
element = do
  t@(TagOpen tagName attrs) <- anyTagOpen
  children <- many node
  closeTag@(TagClose tagName') <- anyTagClose
  if tagName == tagName'
     then return $ Element tagName (HMS.fromList attrs) children
     else fail $ "unexpected close tag" ++ showToken closeTag

Now it's time to define our driver. parseDOM takes a Text and returns either ParseError or [Node]. We used many combinator to represent that there are zero or more occurences of node. We used TagSoup's parseTags to create tokens and passed it to Megaparsec's parse function.

parseDOM :: Text -> Either ParseError [Node]
parseDOM html = parse (many node) "" tags
  where tags = parseTags html