# tagsoup-megaparsec [![Build Status](https://travis-ci.org/kseo/tagsoup-megaparsec.svg?branch=master)](https://travis-ci.org/kseo/tagsoup-megaparsec) A Tag token parser and Tag specific parsing combinators, inspired by [parsec-tagsoup][parsec-tagsoup] and [tagsoup-parsec][tagsoup-parsec]. This library helps you build a megaparsec parser using TagSoup's Tag as tokens. [parsec-tagsoup]: https://hackage.haskell.org/package/parsec-tagsoup [tagsoup-parsec]: https://hackage.haskell.org/package/tagsoup-parsec ## Usage ### DOM parser We can build a DOM parser using TagSoup's Tag as a token type in Megaparsec. Let's start the example with importing all the required modules. ```haskell import Data.Text ( Text ) import qualified Data.Text as T import Data.HashMap.Strict ( HashMap ) import qualified Data.HashMap.Strict as HMS import Text.HTML.TagSoup import Text.Megaparsec import Text.Megaparsec.ShowToken import Text.Megaparsec.TagSoup ``` Here's the data types used to represent our DOM. `Node` is either `ElementNode` or `TextNode`. `TextNode` data constructor takes a `Text` and `ElementNode` data constructor takes an `Element` whose fields consist of `elementName`, `elementAttrs` and `elementChildren`. ```haskell type AttrName = Text type AttrValue = Text data Element = Element { elementName :: !Text , elementAttrs :: !(HashMap AttrName AttrValue) , elementChildren :: [Node] } deriving (Eq, Show) data Node = ElementNode Element | TextNode Text deriving (Eq, Show) ``` Our `Parser` is defined as a type synonym for `TagParser Text`. `TagParser` takes a type argument representing the string type and we chose `Text` here. We can pass any of `StringLike` types such as `String` and `ByteString`. ```haskell type Parser = TagParser Text ``` There is nothing new in defining a parser except that our token is `Tag Text` instead of `Char`. We can use any Megaparsec combinators we want as usual. Our `node` parser is either `element` or `text` so we used the choice combinator `(<|>)`. ```haskell node :: Parser Node node = ElementNode <$> element <|> TextNode <$> text ``` tagsoup-megaparsec library provides some `Tag` specific combinators. * `tagText`: parse a chunk of text. * `anyTagOpen`/`anyTagClose`: parse any opening and closing tag. `text` and `element` parsers are built using these combinators. NOTE: We don't need to worry about the text blocks containing only whitespace characters because all the parsers provided by tagsoup-megaparsec are lexeme parsers. ```haskell text :: Parser Text text = fromTagText <$> tagText element :: Parser Element element = do t@(TagOpen tagName attrs) <- anyTagOpen children <- many node closeTag@(TagClose tagName') <- anyTagClose if tagName == tagName' then return $ Element tagName (HMS.fromList attrs) children else fail $ "unexpected close tag" ++ showToken closeTag ``` Now it's time to define our driver. `parseDOM` takes a `Text` and returns either `ParseError` or `[Node]`. We used `many` combinator to represent that there are zero or more occurences of `node`. We used TagSoup's `parseTags` to create tokens and passed it to Megaparsec's `parse` function. ```haskell parseDOM :: Text -> Either ParseError [Node] parseDOM html = parse (many node) "" tags where tags = parseTags html ```