Portability	portable
Stability	unstable
Maintainer	http://www.cs.york.ac.uk/~ndm/

Text.HTML.TagSoup

Contents

Data structures and parsing
Tag identification
Extraction
Utility
Combinators

Description

This module is for extracting information out of unstructured HTML code, sometimes known as tag-soup. This is for situations where the author of the HTML is not cooperating with the person trying to extract the information, but is also not trying to hide the information.

The standard practice is to parse a String to Tags using parseTags, then operate upon it to extract the necessary information.

Synopsis

Data structures and parsing

data Tag Source

An HTML element, a document is [Tag]. There is no requirement for TagOpen and TagClose to match

Constructors

TagOpen String [Attribute]	An open tag with `Attribute`s in their original order.
TagClose String	A closing tag
TagText String	A text node, guaranteed not to be the empty string
TagComment String	A comment
TagWarning String	Meta: Mark a syntax error in the input file
TagPosition !Row !Column	Meta: The position of a parsed element

Instances

Eq Tag
Ord Tag
Show Tag
TagRep Tag

type Attribute = (String, String)Source

An HTML attribute id="name" generates ("id","name")

data Options Source

Constructors

Options
Fields optTagPosition :: Bool Should `TagPosition` values be given before every item optTagWarning :: Bool Should `TagWarning` values be given optLookupEntity :: String -> [Tag] How to lookup an entity optMaxEntityLength :: Maybe Int The maximum length of an entities content (Nothing for no maximum, default to 10)

options :: Options Source

parseTags :: String -> [Tag]Source

parseTagsOptions :: Options -> String -> [Tag]Source

canonicalizeTags :: [Tag] -> [Tag]Source

Turns all tag names to lower case and converts DOCTYPE to upper case.

Tag identification

isTagOpen :: Tag -> Bool Source

Test if a Tag is a TagOpen

isTagClose :: Tag -> Bool Source

Test if a Tag is a TagClose

isTagText :: Tag -> Bool Source

Test if a Tag is a TagText

isTagWarning :: Tag -> Bool Source

Test if a Tag is a TagWarning

isTagOpenName :: String -> Tag -> Bool Source

Returns True if the Tag is TagOpen and matches the given name

isTagCloseName :: String -> Tag -> Bool Source

Returns True if the Tag is TagClose and matches the given name

Extraction

fromTagText :: Tag -> String Source

Extract the string from within TagText, crashes if not a TagText

fromAttrib :: String -> Tag -> String Source

Extract an attribute, crashes if not a TagOpen. Returns "" if no attribute present.

maybeTagText :: Tag -> Maybe String Source

Extract the string from within TagText, otherwise Nothing

maybeTagWarning :: Tag -> Maybe String Source

Extract the string from within TagWarning, otherwise Nothing

innerText :: [Tag] -> String Source

Extract all text content from tags (similar to Verbatim found in HaXml)

Utility

sections :: (a -> Bool) -> [a] -> [[a]]Source

This function takes a list, and returns all suffixes whose first item matches the predicate.

partitions :: (a -> Bool) -> [a] -> [[a]]Source

This function is similar to sections, but splits the list so no element appears in any two partitions.

Combinators

class TagRep a Source

Define a class to allow String's or Tag's to be used as matches

Instances

TagRep Tag
IsChar c => TagRep [c]

class IsChar a Source

Instances

IsChar Char

(~==) :: TagRep t => Tag -> t -> Bool Source

Performs an inexact match, the first item should be the thing to match. If the second item is a blank string, that is considered to match anything. For example:

 (TagText "test" ~== TagText ""    ) == True
 (TagText "test" ~== TagText "test") == True
 (TagText "test" ~== TagText "soup") == False

For TagOpen missing attributes on the right are allowed.

(~/=) :: TagRep t => Tag -> t -> Bool Source

Negation of ~==