mangrove-0.1.0.0: A parser for web documents according to the HTML5 specification.
Copyright(c) 2020-2021 Sam May
LicenseMPL-2.0
Maintainerag.eitilt@gmail.com
Stabilityprovisional
Portabilityportable
Safe HaskellSafe-Inferred
LanguageHaskell98

Web.Mangrove.Parse.Tree

Description

This module and the internal branch it heads implement the "Tree Construction" section of the HTML document parsing specification, operating over the output of the Web.Mangrove.Parse.Tokenize stage to produce a DOM tree representation of a web page. As this library is still in the early stages of development, the representation produced here is not actually a proper DOM implementation, but instead only stores basic parameters in an equivalent (but less-featured) structure. Nonetheless, it is still enough for basic evaluation and unstyled rendering.

Synopsis

Types

Final

data Tree #

DOM: tree

The core concept underlying HTML and related languages: a nested collection of data and metadata marked up according to several broad categories. Values may be easily instantiated as updates to emptyTree.

Constructors

Tree 

Fields

  • node :: Node

    The atomic portion of the tree at the current location.

  • children :: [Tree]

    All parts of the tree nested below the current location.

Instances

Instances details
Eq Tree 
Instance details

Defined in Web.Willow.DOM

Methods

(==) :: Tree -> Tree -> Bool #

(/=) :: Tree -> Tree -> Bool #

Read Tree 
Instance details

Defined in Web.Willow.DOM

Show Tree 
Instance details

Defined in Web.Willow.DOM

Methods

showsPrec :: Int -> Tree -> ShowS #

show :: Tree -> String #

showList :: [Tree] -> ShowS #

data Node #

DOM: node

The sum type of all different classes of behaviour a particular point of data may fill.

Constructors

Text Text

DOM: Text

A simple character string to be rendered to the output or to be processed further, according to which Elements enclose it.

Comment Text

DOM: Comment

An author's aside, not intended to be shown to the end user.

DocumentType DocumentTypeParams

DOM: DocumentType

Largely vestigial in HTML5, but used in previous versions and related languages to specify the semantics of Elements used in the document.

Element ElementParams

DOM: Element

Markup instructions directing the behaviour or classifying a portion of the document's content.

Attribute AttributeParams

DOM: Attr

Metadata allowing finer customization and description of the heavier Elements.

DocumentFragment

DOM: DocumentType

As like Document, but requiring less precise structure in its children and generally only containing a small slice of a larger document.

Document QuirksMode

DOM: Document

The root of a Tree, typically imposing a principled structure.

Instances

Instances details
Eq Node 
Instance details

Defined in Web.Willow.DOM

Methods

(==) :: Node -> Node -> Bool #

(/=) :: Node -> Node -> Bool #

Read Node 
Instance details

Defined in Web.Willow.DOM

Show Node 
Instance details

Defined in Web.Willow.DOM

Methods

showsPrec :: Int -> Node -> ShowS #

show :: Node -> String #

showList :: [Node] -> ShowS #

data QuirksMode #

Through the long history of HTML browsers, many unique and/or buggy behaviours have become enshrined due to the simple fact that website authors used them. As the standards and the parse engines have continued to develop, three separated degrees of emulation have emerged for that backwards compatibility.

Constructors

NoQuirks

DOM: no-quirks mode

Fully compliant with the modern standard.

LimitedQuirks

DOM: limited-quirks mode

Largely compliant with the standard, except for a couple height calculations.

FullQuirks

DOM: quirks mode

Backwards compatibility with 1990's-era technology.

Intermediate

data Patch Source #

The atomic, self-contained instruction set describing how to build a final document tree.

Instances

Instances details
Eq Patch Source # 
Instance details

Defined in Web.Mangrove.Parse.Tree.Patch

Methods

(==) :: Patch -> Patch -> Bool #

(/=) :: Patch -> Patch -> Bool #

Read Patch Source # 
Instance details

Defined in Web.Mangrove.Parse.Tree.Patch

Show Patch Source # 
Instance details

Defined in Web.Mangrove.Parse.Tree.Patch

Methods

showsPrec :: Int -> Patch -> ShowS #

show :: Patch -> String #

showList :: [Patch] -> ShowS #

data TreeState Source #

The collection of data required to extract a list of semantic atoms from a binary document stream. Values may be easily instantiated as updates to defaultTreeState.

data Encoding #

Encoding: encoding

All character encoding schemes supported by the HTML standard, defined as a bidirectional map between characters and binary sequences. Utf8 is strongly encouraged for new content (including all encoding purposes), but the others are retained for compatibility with existing pages.

Note that none of these are complete functions, to one degree or another, and that no guarantee is made that the mapping round-trips.

Constructors

Utf8

The UTF-8 encoding for Unicode.

Utf16be

The UTF-16 encoding for Unicode, in big endian order.

No encoder is provided for this scheme.

Utf16le

The UTF-16 encoding for Unicode, in little endian order.

No encoder is provided for this scheme.

Big5

Big5, primarily covering traditional Chinese characters.

EucJp

EUC-JP, primarily covering Japanese as the union of JIS-0208 and JIS-0212.

EucKr

EUC-KR, primarily covering Hangul.

Gb18030

The GB18030-2005 extension to GBK, with one tweak for web compatibility, primarily covering both forms of Chinese characters.

Note that this encoding also includes a large number of four-byte sequences which aren't listed in the linked visualization.

Gbk

GBK, primarily covering simplified Chinese characters.

In practice, this is just Gb18030 with a restricted set of encodable characters; the decoder is identical.

Ibm866

DOS and OS/2 code page for Cyrillic characters.

Iso2022Jp

A Japanese-focused implementation of the ISO 2022 meta-encoding, including both JIS-0208 and halfwidth katakana.

Iso8859_2

Latin-2 (Central European).

Iso8859_3

Latin-3 (South European and Esperanto)

Iso8859_4

Latin-4 (North European).

Iso8859_5

Latin/Cyrillic.

Iso8859_6

Latin/Arabic.

Iso8859_7

Latin/Greek (modern monotonic).

Iso8859_8

Latin/Hebrew (visual order).

Iso8859_8i

Latin/Hebrew (logical order).

Iso8859_10

Latin-6 (Nordic).

Iso8859_13

Latin-7 (Baltic Rim).

Iso8859_14

Latin-8 (Celtic).

Iso8859_15

Latin-9 (revision of ISO 8859-1 Latin-1, Western European).

Iso8859_16

Latin-10 (South-Eastern European).

Koi8R

KOI-8 specialized for Russian Cyrillic.

Koi8U

KOI-8 specialized for Ukrainian Cyrillic.

Macintosh

Mac OS Roman.

MacintoshCyrillic

Mac OS Cyrillic (as of Mac OS 9.0)

ShiftJis

The Windows variant (code page 932) of Shift JIS.

Windows874

ISO 8859-11 Latin/Thai with Windows extensions in the C1 control character slots.

Note that this encoding is always used instead of pure Latin/Thai.

Windows1250

The Windows extension and rearrangement of ISO 8859-2 Latin-2.

Windows1251

Windows Cyrillic.

Windows1252

The Windows extension of ISO 8859-1 Latin-1, replacing most of the C1 control characters with printable glyphs.

Note that this encoding is always used instead of pure Latin-1.

Windows1253

Windows Greek (modern monotonic).

Windows1254

The Windows extension of ISO 8859-9 Latin-5 (Turkish), replacing most of the C1 control characters with printable glyphs.

Note that this encoding is always used instead of pure Latin-5.

Windows1255

The Windows extension and rearrangement of ISO 8859-8 Latin/Hebrew.

Windows1256

Windows Arabic.

Windows1257

Windows Baltic.

Windows1258

Windows Vietnamese.

Replacement

The input is reduced to a single \xFFFD replacement character.

No encoder is provided for this scheme.

UserDefined

Non-ASCII bytes (\x80 through \xFF) are mapped to a portion of the Unicode Private Use Area (\xF780 through \xF7FF).

Instances

Instances details
Bounded Encoding 
Instance details

Defined in Web.Willow.Common.Encoding.Common

Enum Encoding 
Instance details

Defined in Web.Willow.Common.Encoding.Common

Eq Encoding 
Instance details

Defined in Web.Willow.Common.Encoding.Common

Ord Encoding 
Instance details

Defined in Web.Willow.Common.Encoding.Common

Read Encoding 
Instance details

Defined in Web.Willow.Common.Encoding.Common

Show Encoding 
Instance details

Defined in Web.Willow.Common.Encoding.Common

Hashable Encoding 
Instance details

Defined in Web.Willow.Common.Encoding.Common

Methods

hashWithSalt :: Int -> Encoding -> Int #

hash :: Encoding -> Int #

type NodeIndex = Natural Source #

Type-level clarification for an identifier uniquely identifying each element in the document tree, assigned in a rough tree order.

The specification assumes a reference-based and mutable memory model, where each element in, e.g., the stack of open elements not only describes the shape of the node, but also is distinct from all other nodes with the same data. Haskell's memory is much less accessible and more likely to be shared, so an extra datapoint needs to be carried along.

data ElementParams #

DOM: Element

The collection of metadata identifying and describing a markup tag used to associate text or other data with its broader role in the document, or to indicate a preferred rendering. Values may be easily instantiated as updates to emptyElementParams.

Constructors

ElementParams 

Fields

emptyElementParams :: ElementParams #

A sane default collection for easy record initialization.

Initialization

defaultTreeState :: TreeState Source #

The collection of data which results in behaviour according to the "initially" instructions in the HTML tree construction algorithm.

treeEncoding :: Either SnifferEnvironment (Maybe Encoding) -> TreeState -> TreeState Source #

Specify the encoding scheme a given parse environment should use to read from the binary document stream. Note that this will always use the initial state for the respective decoder; intermediate states as returned by decodeStep are not supported.

treeFragment Source #

Arguments

:: ElementParams

HTML: context element

The node wrapping -- in one way or another -- the embedded document fragment.

-> [(NodeIndex, ElementParams)]

The ancestors of the context element, most immediate first.

-> Maybe QuirksMode

The degree of backwards compatibility used in the node document of the context element, if it can be determined.

-> Maybe Bool

Whether the node document of the context element has been parsed in a way which would require scripting to be enabled (Just True) or disabled (Just False).

-> TreeState 
-> TreeState 

HTML: fragment parsing algorithm

Transform a given parse environment by adding context for an embedded but separate document fragment. Calling this with an intermediate state returned by treeStep (as opposed to an initial state from defaultTreeState) may result in an unexpected tree structure.

treeInIFrame :: Bool -> TreeState -> TreeState Source #

Specify whether the given parse environment should be treated as if the document were contained within the srcdoc attribute of an <iframe> element (False by default).

Transformations

tree :: TreeState -> ByteString -> ([Patch], TreeState) Source #

HTML: tree construction

Given a starting environment, transform a binary document stream into a hierarchical markup tree. If the parse fails, returns an empty tree (a Document node with no children).

treeStep :: TreeState -> ByteString -> ([Patch], TreeState, ByteString) Source #

Parse a minimal number of tokens from a binary document stream, into a state-independent sequence of folding instructions. Returns all data required to seamlessly resume parsing.

finalizeTree :: [Patch] -> TreeState -> Tree Source #

Explicitly indicate that the input stream will not contain any further bytes, and perform any finalization processing based on that.