tagsoup-0.1: Parsing and extracting information from (possibly malformed) HTML documents

Stabilitymoving towards stable




This module is for extracting information out of unstructured HTML code, sometimes known as tag-soup. This is for situations where the author of the HTML is not cooperating with the person trying to extract the information, but is also not trying to hide the information.

The standard practice is to parse a String to Tags using parseTags, then operate upon it to extract the necessary information.


Data structures and parsing

data Tag Source

An HTML element, a document is [Tag]. There is no requirement for TagOpen and TagClose to match


TagOpen String [Attribute]

An open tag with Attributes in their original order.

TagClose String

A closing tag

TagText String

A text node, guranteed not to be the empty string


type Attribute = (String, String)Source

An HTML attribute id="name" generates ("id","name")

parseTags :: String -> [Tag]Source

Parse an HTML document to a list of Tag. Automatically expands out escape characters.

Tag Combinators

(~==) :: Tag -> Tag -> BoolSource

Performs an inexact match, the first item should be the thing to match. If the second item is a blank string, that is considered to match anything. For example:

 (TagText "test" ~== TagText ""    ) == True
 (TagText "test" ~== TagText "test") == True
 (TagText "test" ~== TagText "soup") == False

For TagOpen missing attributes on the right are allowed.

(~/=) :: Tag -> Tag -> BoolSource

Negation of ~==

isTagOpen :: Tag -> BoolSource

Test if a Tag is a TagOpen

isTagClose :: Tag -> BoolSource

Test if a Tag is a TagClose

isTagText :: Tag -> BoolSource

Test if a Tag is a TagText

fromTagText :: Tag -> StringSource

Extract the string from within TagText, crashes if not a TagText

isTagOpenName :: String -> Tag -> BoolSource

Returns True if the Tag is TagOpen and matches the given name

isTagCloseName :: String -> Tag -> BoolSource

Returns True if the Tag is TagClose and matches the given name

sections :: (a -> Bool) -> [a] -> [[a]]Source

This function takes a list, and returns all initial lists whose first item matches the function.

partitions :: (a -> Bool) -> [a] -> [[a]]Source

This function is similar to sections, but splits the list so no element appears in any two partitions