tagsoup-0.14: Parsing and extracting information from (possibly malformed) HTML/XML documents

Safe HaskellSafe
LanguageHaskell98

Text.HTML.TagSoup

Contents

Description

This module is for working with HTML/XML. It deals with both well-formed XML and malformed HTML from the web. It features:

  • A lazy parser, based on the HTML 5 specification - see parseTags.
  • A renderer that can write out HTML/XML - see renderTags.
  • Utilities for extracting information from a document - see ~==, sections and partitions.

The standard practice is to parse a String to [Tag String] using parseTags, then operate upon it to extract the necessary information.

Synopsis

Data structures and parsing

data Tag str Source

A single HTML element. A whole document is represented by a list of Tag. There is no requirement for TagOpen and TagClose to match.

Constructors

TagOpen str [Attribute str]

An open tag with Attributes in their original order

TagClose str

A closing tag

TagText str

A text node, guaranteed not to be the empty string

TagComment str

A comment

TagWarning str

Meta: A syntax error in the input file

TagPosition !Row !Column

Meta: The position of a parsed element

Instances

Functor Tag Source 
Eq str => Eq (Tag str) Source 
Data str => Data (Tag str) Source 
Ord str => Ord (Tag str) Source 
Show str => Show (Tag str) Source 
StringLike str => TagRep (Tag str) Source 

type Row = Int Source

The row/line of a position, starting at 1

type Column = Int Source

The column of a position, starting at 1

type Attribute str = (str, str) Source

An HTML attribute id="name" generates ("id","name")

parseTags :: StringLike str => str -> [Tag str] Source

Parse a string to a list of tags, using an HTML 5 compliant parser.

parseTags "<hello>my&amp;</world>" == [TagOpen "hello" [],TagText "my&",TagClose "world"]

parseTagsOptions :: StringLike str => ParseOptions str -> str -> [Tag str] Source

Parse a string to a list of tags, using settings supplied by the ParseOptions parameter, eg. to output position information:

parseTagsOptions parseOptions{optTagPosition = True} "<hello>my&amp;</world>" ==
   [TagPosition 1 1,TagOpen "hello" [],TagPosition 1 8,TagText "my&",TagPosition 1 15,TagClose "world"]

data ParseOptions str Source

These options control how parseTags works. The ParseOptions type is usually generated by one of parseOptions, parseOptionsFast or parseOptionsEntities, then selected fields may be overriden.

The options optTagPosition and optTagWarning specify whether to generate TagPosition or TagWarning elements respectively. Usually these options should be set to False to simplify future stages, unless you rely on position information or want to give malformed HTML messages to the end user.

The options optEntityData and optEntityAttrib control how entities, for example &nbsp; are handled. Both take a string, and a boolean, where True indicates that the entity ended with a semi-colon ;. Inside normal text optEntityData will be called, and the results will be inserted in the tag stream. Inside a tag attribute optEntityAttrib will be called, and the first component of the result will be used in the attribute, and the second component will be appended after the TagOpen value (usually the second component is []). As an example, to not decode any entities, pass:

parseOptions
    {optEntityData=\(str,b) -> [TagText $ "&" ++ str ++ [';' | b]]
    ,optEntityAttrib\(str,b) -> ("&" ++ str ++ [';' | b], [])

Constructors

ParseOptions 

Fields

optTagPosition :: Bool

Should TagPosition values be given before some items (default=False,fast=False).

optTagWarning :: Bool

Should TagWarning values be given (default=False,fast=False)

optEntityData :: (str, Bool) -> [Tag str]

How to lookup an entity (Bool = has ending ';')

optEntityAttrib :: (str, Bool) -> (str, [Tag str])

How to lookup an entity in an attribute (Bool = has ending ';'?)

optTagTextMerge :: Bool

Require no adjacent TagText values (default=True,fast=False)

parseOptions :: StringLike str => ParseOptions str Source

The default parse options value, described in ParseOptions. Equivalent to parseOptionsEntities lookupEntity.

parseOptionsFast :: StringLike str => ParseOptions str Source

A ParseOptions structure optimised for speed, following the fast options.

parseOptionsEntities :: StringLike str => (str -> Maybe str) -> ParseOptions str Source

A ParseOptions structure using a custom function to lookup attributes. Any attribute that is not found will be left intact, and a TagWarning given (if optTagWarning is set).

If you do not want to resolve any entities, simpliy pass const Nothing for the lookup function.

renderTags :: StringLike str => [Tag str] -> str Source

Show a list of tags, as they might have been parsed, using the default settings given in RenderOptions.

renderTags [TagOpen "hello" [],TagText "my&",TagClose "world"] == "<hello>my&amp;</world>"

renderTagsOptions :: StringLike str => RenderOptions str -> [Tag str] -> str Source

Show a list of tags using settings supplied by the RenderOptions parameter, eg. to avoid escaping any characters one could do:

renderTagsOptions renderOptions{optEscape = id} [TagText "my&"] == "my&"

escapeHTML :: StringLike str => str -> str Source

Replace the four characters &"<> with their HTML entities (escapeXML lifted to StringLike).

data RenderOptions str Source

These options control how renderTags works.

The strange quirk of only minimizing <br> tags is due to Internet Explorer treating <br></br> as <br><br>.

Constructors

RenderOptions 

Fields

optEscape :: str -> str

Escape a piece of text (default = escape the four characters &"<>)

optMinimize :: str -> Bool

Minimise <b></b> -> <b/> (default = minimise only <br> tags)

optRawTag :: str -> Bool

Should a tag be output with no escaping (default = true only for script)

renderOptions :: StringLike str => RenderOptions str Source

The default render options value, described in RenderOptions.

canonicalizeTags :: StringLike str => [Tag str] -> [Tag str] Source

Turns all tag names and attributes to lower case and converts DOCTYPE to upper case.

Tag identification

isTagOpen :: Tag str -> Bool Source

Test if a Tag is a TagOpen

isTagClose :: Tag str -> Bool Source

Test if a Tag is a TagClose

isTagText :: Tag str -> Bool Source

Test if a Tag is a TagText

isTagWarning :: Tag str -> Bool Source

Test if a Tag is a TagWarning

isTagPosition :: Tag str -> Bool Source

Test if a Tag is a TagPosition

isTagOpenName :: Eq str => str -> Tag str -> Bool Source

Returns True if the Tag is TagOpen and matches the given name

isTagCloseName :: Eq str => str -> Tag str -> Bool Source

Returns True if the Tag is TagClose and matches the given name

isTagComment :: Tag str -> Bool Source

Test if a Tag is a TagComment

Extraction

fromTagText :: Show str => Tag str -> str Source

Extract the string from within TagText, crashes if not a TagText

fromAttrib :: (Show str, Eq str, StringLike str) => str -> Tag str -> str Source

Extract an attribute, crashes if not a TagOpen. Returns "" if no attribute present.

maybeTagText :: Tag str -> Maybe str Source

Extract the string from within TagText, otherwise Nothing

maybeTagWarning :: Tag str -> Maybe str Source

Extract the string from within TagWarning, otherwise Nothing

innerText :: StringLike str => [Tag str] -> str Source

Extract all text content from tags (similar to Verbatim found in HaXml)

Utility

sections :: (a -> Bool) -> [a] -> [[a]] Source

This function takes a list, and returns all suffixes whose first item matches the predicate.

partitions :: (a -> Bool) -> [a] -> [[a]] Source

This function is similar to sections, but splits the list so no element appears in any two partitions.

Combinators

class TagRep a where Source

Define a class to allow String's or Tag str's to be used as matches

Methods

toTagRep :: StringLike str => a -> Tag str Source

Convert a value into a Tag.

Instances

(~==) :: (StringLike str, TagRep t) => Tag str -> t -> Bool Source

Performs an inexact match, the first item should be the thing to match. If the second item is a blank string, that is considered to match anything. For example:

(TagText "test" ~== TagText ""    ) == True
(TagText "test" ~== TagText "test") == True
(TagText "test" ~== TagText "soup") == False

For TagOpen missing attributes on the right are allowed.

(~/=) :: (StringLike str, TagRep t) => Tag str -> t -> Bool Source

Negation of ~==