This module is for extracting information out of unstructured HTML code, sometimes known as tag-soup. This is for situations where the author of the HTML is not cooperating with the person trying to extract the information, but is also not trying to hide the information.
- data Tag
- type Attribute = (String, String)
- module Text.HTML.TagSoup.Parser
- canonicalizeTags :: [Tag] -> [Tag]
- isTagOpen :: Tag -> Bool
- isTagClose :: Tag -> Bool
- isTagText :: Tag -> Bool
- isTagWarning :: Tag -> Bool
- isTagOpenName :: String -> Tag -> Bool
- isTagCloseName :: String -> Tag -> Bool
- fromTagText :: Tag -> String
- fromAttrib :: String -> Tag -> String
- maybeTagText :: Tag -> Maybe String
- maybeTagWarning :: Tag -> Maybe String
- innerText :: [Tag] -> String
- sections :: (a -> Bool) -> [a] -> [[a]]
- partitions :: (a -> Bool) -> [a] -> [[a]]
- class TagRep a
- class IsChar a
- (~==) :: TagRep t => Tag -> t -> Bool
- (~/=) :: TagRep t => Tag -> t -> Bool
Data structures and parsing
|TagOpen String [Attribute]|
An open tag with
A closing tag
A text node, guaranteed not to be the empty string
Meta: Mark a syntax error in the input file
|TagPosition !Row !Column|
Meta: The position of a parsed element
Turns all tag names to lower case and converts DOCTYPE to upper case.
Extract an attribute, crashes if not a
"" if no attribute present.
Extract all text content from tags (similar to Verbatim found in HaXml)
This function takes a list, and returns all suffixes whose first item matches the predicate.
This function is similar to
sections, but splits the list
so no element appears in any two partitions.
Define a class to allow String's or Tag's to be used as matches
Performs an inexact match, the first item should be the thing to match. If the second item is a blank string, that is considered to match anything. For example:
(TagText "test" ~== TagText "" ) == True (TagText "test" ~== TagText "test") == True (TagText "test" ~== TagText "soup") == False
TagOpen missing attributes on the right are allowed.