Portability | portable |
---|---|
Stability | unstable |
Maintainer | http://www.cs.york.ac.uk/~ndm/ |
This module is for extracting information out of unstructured HTML code, sometimes known as tag-soup. This is for situations where the author of the HTML is not cooperating with the person trying to extract the information, but is also not trying to hide the information.
The standard practice is to parse a String to Tag
s using parseTags
,
then operate upon it to extract the necessary information.
- data Tag
- = TagOpen String [Attribute]
- | TagClose String
- | TagText String
- | TagComment String
- | TagWarning String
- | TagPosition !Row !Column
- type Attribute = (String, String)
- data Options = Options {
- optTagPosition :: Bool
- optTagWarning :: Bool
- optLookupEntity :: String -> [Tag]
- optMaxEntityLength :: Maybe Int
- options :: Options
- parseTags :: String -> [Tag]
- parseTagsOptions :: Options -> String -> [Tag]
- canonicalizeTags :: [Tag] -> [Tag]
- isTagOpen :: Tag -> Bool
- isTagClose :: Tag -> Bool
- isTagText :: Tag -> Bool
- isTagWarning :: Tag -> Bool
- isTagOpenName :: String -> Tag -> Bool
- isTagCloseName :: String -> Tag -> Bool
- fromTagText :: Tag -> String
- fromAttrib :: String -> Tag -> String
- maybeTagText :: Tag -> Maybe String
- maybeTagWarning :: Tag -> Maybe String
- innerText :: [Tag] -> String
- sections :: (a -> Bool) -> [a] -> [[a]]
- partitions :: (a -> Bool) -> [a] -> [[a]]
- class TagRep a
- class IsChar a
- (~==) :: TagRep t => Tag -> t -> Bool
- (~/=) :: TagRep t => Tag -> t -> Bool
Data structures and parsing
TagOpen String [Attribute] | An open tag with |
TagClose String | A closing tag |
TagText String | A text node, guaranteed not to be the empty string |
TagComment String | A comment |
TagWarning String | Meta: Mark a syntax error in the input file |
TagPosition !Row !Column | Meta: The position of a parsed element |
Options | |
|
parseTagsOptions :: Options -> String -> [Tag]Source
canonicalizeTags :: [Tag] -> [Tag]Source
Turns all tag names to lower case and converts DOCTYPE to upper case.
Tag identification
isTagWarning :: Tag -> BoolSource
Test if a Tag
is a TagWarning
isTagOpenName :: String -> Tag -> BoolSource
isTagCloseName :: String -> Tag -> BoolSource
Extraction
fromAttrib :: String -> Tag -> StringSource
Extract an attribute, crashes if not a TagOpen
.
Returns ""
if no attribute present.
maybeTagWarning :: Tag -> Maybe StringSource
Extract the string from within TagWarning
, otherwise Nothing
innerText :: [Tag] -> StringSource
Extract all text content from tags (similar to Verbatim found in HaXml)
Utility
sections :: (a -> Bool) -> [a] -> [[a]]Source
This function takes a list, and returns all suffixes whose first item matches the predicate.
partitions :: (a -> Bool) -> [a] -> [[a]]Source
This function is similar to sections
, but splits the list
so no element appears in any two partitions.
Combinators
Define a class to allow String's or Tag's to be used as matches
(~==) :: TagRep t => Tag -> t -> BoolSource
Performs an inexact match, the first item should be the thing to match. If the second item is a blank string, that is considered to match anything. For example:
(TagText "test" ~== TagText "" ) == True (TagText "test" ~== TagText "test") == True (TagText "test" ~== TagText "soup") == False
For TagOpen
missing attributes on the right are allowed.