tagsoup-0.11.1: Parsing and extracting information from (possibly malformed) HTML/XML documents

Text.HTML.TagSoup

Contents

Description

This module is for working with HTML/XML. It deals with both well-formed XML and malformed HTML from the web. It features:

  • A lazy parser, based on the HTML 5 specification - see parseTags.
  • A renderer that can write out HTML/XML - see renderTags.
  • Utilities for extracting information from a document - see ~==, sections and partitions.

The standard practice is to parse a String to [Tag String] using parseTags, then operate upon it to extract the necessary information.

Synopsis

Data structures and parsing

data Tag str Source

A single HTML element. A whole document is represented by a list of Tag. There is no requirement for TagOpen and TagClose to match.

Constructors

TagOpen str [Attribute str]

An open tag with Attributes in their original order

TagClose str

A closing tag

TagText str

A text node, guaranteed not to be the empty string

TagComment str

A comment

TagWarning str

Meta: A syntax error in the input file

TagPosition !Row !Column

Meta: The position of a parsed element

Instances

Functor Tag 
Typeable1 Tag 
Eq str => Eq (Tag str) 
Data str => Data (Tag str) 
Ord str => Ord (Tag str) 
Show str => Show (Tag str) 
StringLike str => TagRep (Tag str) 

type Row = IntSource

The row/line of a position, starting at 1

type Column = IntSource

The column of a position, starting at 1

type Attribute str = (str, str)Source

An HTML attribute id="name" generates ("id","name")

parseTags :: StringLike str => str -> [Tag str]Source

Parse a string to a list of tags, using an HTML 5 compliant parser.

 parseTags "<hello>my&amp;</world>" == [TagOpen "hello" [],TagText "my&",TagClose "world"]

parseTagsOptions :: StringLike str => ParseOptions str -> str -> [Tag str]Source

Parse a string to a list of tags, using settings supplied by the ParseOptions parameter, eg. to output position information:

 parseTagsOptions parseOptions{optTagPosition = True} "<hello>my&amp;</world>" ==
    [TagPosition 1 1,TagOpen "hello" [],TagPosition 1 8,TagText "my&",TagPosition 1 15,TagClose "world"]

data ParseOptions str Source

These options control how parseTags works.

Constructors

ParseOptions 

Fields

optTagPosition :: Bool

Should TagPosition values be given before some items (default=False,fast=False)

optTagWarning :: Bool

Should TagWarning values be given (default=False,fast=False)

optEntityData :: (str, Bool) -> [Tag str]

How to lookup an entity (Bool = has ending ';')

optEntityAttrib :: (str, Bool) -> (str, [Tag str])

How to lookup an entity in an attribute (Bool = has ending ';'?)

optTagTextMerge :: Bool

Require no adjacent TagText values (default=True,fast=False)

parseOptions :: StringLike str => ParseOptions strSource

The default parse options value, described in ParseOptions.

parseOptionsFast :: StringLike str => ParseOptions strSource

A ParseOptions structure optimised for speed, following the fast options.

renderTags :: StringLike str => [Tag str] -> strSource

Show a list of tags, as they might have been parsed, using the default settings given in RenderOptions.

 renderTags [TagOpen "hello" [],TagText "my&",TagClose "world"] == "<hello>my&amp;</world>"

renderTagsOptions :: StringLike str => RenderOptions str -> [Tag str] -> strSource

Show a list of tags using settings supplied by the RenderOptions parameter, eg. to avoid escaping any characters one could do:

 renderTagsOptions renderOptions{optEscape = id} [TagText "my&"] == "my&"

escapeHTML :: StringLike str => str -> strSource

Replace the four characters &"<> with their HTML entities (the list from xmlEntities).

data RenderOptions str Source

These options control how renderTags works.

The strange quirk of only minimizing <br> tags is due to Internet Explorer treating <br></br> as <br><br>.

Constructors

RenderOptions 

Fields

optEscape :: str -> str

Escape a piece of text (default = escape the four characters &"<>)

optMinimize :: str -> Bool

Minimise <b></b> -> <b/> (default = minimise only <br> tags)

renderOptions :: StringLike str => RenderOptions strSource

The default render options value, described in RenderOptions.

canonicalizeTags :: StringLike str => [Tag str] -> [Tag str]Source

Turns all tag names and attributes to lower case and converts DOCTYPE to upper case.

Tag identification

isTagOpen :: Tag str -> BoolSource

Test if a Tag is a TagOpen

isTagClose :: Tag str -> BoolSource

Test if a Tag is a TagClose

isTagText :: Tag str -> BoolSource

Test if a Tag is a TagText

isTagWarning :: Tag str -> BoolSource

Test if a Tag is a TagWarning

isTagPosition :: Tag str -> BoolSource

Test if a Tag is a TagPosition

isTagOpenName :: Eq str => str -> Tag str -> BoolSource

Returns True if the Tag is TagOpen and matches the given name

isTagCloseName :: Eq str => str -> Tag str -> BoolSource

Returns True if the Tag is TagClose and matches the given name

Extraction

fromTagText :: Show str => Tag str -> strSource

Extract the string from within TagText, crashes if not a TagText

fromAttrib :: (Show str, Eq str, StringLike str) => str -> Tag str -> strSource

Extract an attribute, crashes if not a TagOpen. Returns "" if no attribute present.

maybeTagText :: Tag str -> Maybe strSource

Extract the string from within TagText, otherwise Nothing

maybeTagWarning :: Tag str -> Maybe strSource

Extract the string from within TagWarning, otherwise Nothing

innerText :: StringLike str => [Tag str] -> strSource

Extract all text content from tags (similar to Verbatim found in HaXml)

Utility

sections :: (a -> Bool) -> [a] -> [[a]]Source

This function takes a list, and returns all suffixes whose first item matches the predicate.

partitions :: (a -> Bool) -> [a] -> [[a]]Source

This function is similar to sections, but splits the list so no element appears in any two partitions.

Combinators

class TagRep a Source

Define a class to allow String's or Tag str's to be used as matches

Instances

(~==) :: (StringLike str, TagRep t) => Tag str -> t -> BoolSource

Performs an inexact match, the first item should be the thing to match. If the second item is a blank string, that is considered to match anything. For example:

 (TagText "test" ~== TagText ""    ) == True
 (TagText "test" ~== TagText "test") == True
 (TagText "test" ~== TagText "soup") == False

For TagOpen missing attributes on the right are allowed.

(~/=) :: (StringLike str, TagRep t) => Tag str -> t -> BoolSource

Negation of ~==